DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The amendment filed November 11th, 2025 has been entered. Claims 1, 3, 6-9, and 13 have been amended. Claims 2, 4-5, and 10-12 have been cancelled. Claims 1, 3, 6-9, and 13-14 are pending and have been examined. Applicant’s amendments to the specification and claims have overcome all objections and all rejections under 35 U.S.C. 112 previously set forth in the non-final office action mailed August 20th, 2025.
Response to Arguments
Applicant’s arguments with respect to claim(s) 1, 3, 6-9, and 13-14 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1, 3, 6-9, and 13-14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Srivastava et al. (US 2019/0090020 A1 hereinafter Srivastava), in view of Venkataraman et al. (US Pat. No. 12,190,906 B1 hereinafter Venkataraman) and Rastrow et al. (US. Pat. Pub. No. 2021/0295833 A1 hereinafter Rastrow).
Regarding claim 1, Srivastava discloses a processing circuit of an electronic device (Srivastava, [0036]: “Examples of implementation of the calibration system 102 may include a projector, a smart television (TV), a personal computer, a special-purpose device, a media receiver, such as a set top box (STB), a digital media player, a micro-console, a game console, an (High Definition Multimedia Interface) HDMI compliant source device, a smartphone, a tablet computer, a personal computer, a laptop computer, a media processing system, or a calibration device.”), comprising: an audio/video content generation circuit, configured to generate audio data and video data to a speaker and a display panel, respectively (Srivastava, [0037]: "The media device 110 may or may not include a display screen or a projection means, an audio device, and a set of input/output (I/O) devices. The media device 110 may be placed in a closed environment such that the playback of the set of media items through the media device 110 may lie in a field of audio-visual (FOA-V) reception of the audience 112. The media device 110 may comprise at least a first speaker to output an audio signal of the media item 104."); and a user hotspot detection circuit, configured to receive a microphone input from a microphone of the electronic device, and detect the microphone input to generate a user hotspot detection result when the speaker plays the audio data and the display panel shows the video data (Srivastava, [0039]: "The plurality of different types of sensors 114 may comprise suitable logic, circuitry, and interface that may be configured to capture a plurality of different types of input signals from an audience (e.g., the test audience 116 or the audience 112), at the playback of a media item (e.g., the media item 104). The captured plurality of different types of input signals may correspond to emotional response data associated with the audience 112 at the playback of the media item 104. The plurality of different types of sensors 114 may include the set of audio sensors 114A, the set of image sensors 114B, and the set of biometric sensors 114C."; [0040]: "The set of audio sensors 114A may be a set of microphones"); and an output circuit, configured to store the user hotspot detection result (Srivastava, [0050]: "With the initialization of the playback of the media item 104 at the media device 110, the plurality of different types of sensors 114 may be activated to capture a set of emotional responses continuously at a playback duration of the media item 104."); wherein the user hotspot detection circuit generates the user hotspot detection result according to the user emotion detection result and the VAD result (Srivastava, [0030]: "the control circuitry may be further configured to generate an amalgamated audience response signal for the media item. Such amalgamated audience response signals may be generated based on the synchronized plurality of different input signals and a plurality of weights assigned to the plurality of different types of sensors. The control circuitry may further identify a set of common positive peaks and a set of common negative peaks in each of the plurality of different types of input signals based on the overlay of the plurality of different types of input signals. A plurality of highlight points and a plurality of lowlight points may be further calculated by the control circuitry for a plurality of scenes of the media item, based on the identified set of common positive peaks and the set of common negative peaks."). However, Srivastava fails to expressly recite wherein the user hotspot detection circuit comprises: an acoustic echo cancellation (AEC) circuit, configured to cancel or reduce an echo or the environment noise to generate a clean microphone input; an emotion detection circuit, configured to generate a user emotion detection result indicating which user emotion the clean microphone input corresponds to; and a voice activity detection (VAD) circuit, configured to detect if the clean microphone input comprises human voice or human speech, and to detect strength of the human voice or the human speech, to generate a VAD result, wherein the VAD result comprises information indicating at which level the strength of the human voice or the human speech is located.
Venkataraman teaches wherein the user hotspot detection circuit comprises: an acoustic echo cancellation (AEC) circuit, configured to cancel or reduce an echo or the environment noise to generate a clean microphone input (Venkataraman, Col. 16, lines 17-26: "Once each input has been received, for example by an emotion prediction system, the input may be processed by multimodal pre-processing 510. Such pre-processing may, depending on the type of input, include performing noise reduction in images or audio, correcting misspelled words (while preserving the misspellings for emotion prediction purposes), correcting grammatical errors (while preserving the grammatical errors for emotion prediction purposes), determining an endpoint of speech, and/or reducing background sounds in audio."); and an emotion detection circuit, configured to generate a user emotion detection result indicating which user emotion the clean microphone input corresponds to (Venkataraman, Col. 16, lines 40-46: "Using a trained machine learning model, the multimodal features, text converted from audio, and/or the context of the audio, may be utilized to determine an emotion, such as anger 516, disgust 518, fear 520, happiness 522, a neutral expression 524, sadness 526, surprise 528, and/or other emotions that a user (e.g., customer and/or agent) may experience.").
Srivastava and Venkataraman are analogous arts because they both belong to the same field of emotion detection. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the audience response capture system of Srivastava to incorporate the teachings of Venkataraman to clean an input and detect emotion from the input using Mel-scale frequency cepstral coefficient features. This allows for a multimodal input to be used when detecting user emotions (Venkataraman, Col. 1, Brief Summary). This enables the system to improve user experiences by more effectively identifying and responding to emotions. However, Srivastava, in view of Venkataraman, fails to expressly recite a voice activity detection (VAD) circuit, configured to detect if the clean microphone input comprises human voice or human speech, and to detect strength of the human voice or the human speech, to generate a VAD result, wherein the VAD result comprises information indicating at which level the strength of the human voice or the human speech is located.
Rastrow teaches a voice activity detection (VAD) circuit, configured to detect if the clean microphone input comprises human voice or human speech, and to detect strength of the human voice or the human speech, to generate a VAD result, wherein the VAD result comprises information indicating at which level the strength of the human voice or the human speech is located (Rastrow, [0198]: “The interrupt detector 1010 and/or the device directed classifier 1020 may determine whether device-directed speech is detected using at least one classifier. For example, the classifier may consider a variety of information, including a volume level of the speech (e.g., whether the speech is louder or quieter than normal speech), speaker identification (e.g., identification data) corresponding to the speech, which may be determined using voice recognition, image data, and/or other techniques known to one of skill in the art, emotion detection (e.g., emotion data) corresponding to the speech (e.g., whether the speech is animated or quiet, for example), a length of time between when the output audio begins and when the speech is detected, and/or the like without departing from the disclosure. The classifier may process the input information and generate model output data, which may indicate a likelihood that the speech is directed to the device.”; [0046]: “Some embodiments may apply voice activity detection (VAD) techniques. Such techniques may determine whether speech is present in audio data based on various quantitative aspects of the audio data, such as the spectral slope between one or more frames of the audio data; the energy levels of the audio data in one or more spectral bands; the signal-to-noise ratios of the audio data in one or more spectral bands; or other quantitative aspects.”).
Srivastava, Venkataraman, and Rastrow are analogous arts because they each belong to the same field of audio processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the audience response capture system of Srivastava, as modified by the emotion prediction system of Venkataraman, to incorporate the teachings of Rastrow to detect the strength of a human voice in the audio input. Detecting strength of a human voice allows the system to differentiate between voices the system is intended to pick up and voices that may be additional background noise (Rastrow, [0198]). This ensures that the system can determine if a human voice constitutes useful data, which improves the accuracy of the system.
Regarding claim 3, the rejection of claim 1 is incorporated. Srivastava, in view of Venkataraman and Rastrow, discloses all of the elements of the current invention as stated above. Venkataraman further teaches wherein the emotion detection circuit comprises: a Mel-scale frequency cepstral coefficients (MFCC) feature extraction circuit, configured to receive the clean microphone input to generate MFCC features (Venkataraman, Col. 15, lines 23-28: "acoustic features 412 (e.g., including Mel Frequency Cepstral Coefficients, Zero Crossing Rate, and/or Spectral Features) may be generated from audio 406 gathered between a customer and agent"); an artificial intelligence (AI) model, configured to receive the MFCC features to generate corresponding emotion; and a determination circuit, configured to generate an user emotion detection result according to the emotion determined by the AI model (Venkataraman, Col. 16, lines 40-46: "Using a trained machine learning model, the multimodal features, text converted from audio, and/or the context of the audio, may be utilized to determine an emotion, such as anger 516, disgust 518, fear 520, happiness 522, a neutral expression 524, sadness 526, surprise 528, and/or other emotions that a user (e.g., customer and/or agent) may experience."). The same motivation for claim 1 applies equally to claim 3.
Regarding claim 6, the rejection of claim 1 is incorporated. Srivastava, in view of Venkataraman and Rastrow, discloses all of the elements of the current invention as stated above. Srivastava further discloses wherein the user hotspot detection result comprises information of the user emotion and corresponding timing of audio/video content (Srivastava, [0029]: "the control circuitry may be further configured to normalize the received plurality of different types of input signals. Such normalization may be done based on the determined peak emotional response level for each emotional response by each user at the playback of the media item. The received plurality of different types of input signals may be normalized further based on a geographical region of the audience. The control circuitry may further synchronize and overlay the normalized plurality of different input signals in a timeline, which may be same as a playback timeline of the media item.").
Regarding claim 7, the rejection of claim 1 is incorporated. Srivastava, in view of Venkataraman and Rastrow, discloses all of the elements of the current invention as stated above. Venkataraman further teaches an audio/video content recognition circuit, configured to recognize audio content corresponding to the audio data to generate an audio/video content recognition result; wherein the output circuit further stores the audio/video content recognition result (Venkataraman, Col. 16, lines 35-39: "If the multimodal input includes audio, the audio may be transmitted to ASR and embedding 514. The ASR and embedding 514 may convert the audio to text and, using a trained machine learning model, generate context vectors from the audio and resulting text."). The same motivation for claim 2 applies equally to claim 7.
Regarding claim 8, the rejection of claim 7 is incorporated. Srivastava, in view of Venkataraman and Rastrow, discloses all of the elements of the current invention as stated above. Venkataraman further teaches wherein the audio/video content recognition circuit comprises: a MFCC feature extraction circuit, configured to receive the audio content to generate MFCC features (Venkataraman, Col. 15, lines 23-28: "acoustic features 412 (e.g., including Mel Frequency Cepstral Coefficients, Zero Crossing Rate, and/or Spectral Features) may be generated from audio 406 gathered between a customer and agent"); an AI model, configured to receive the MFCC features to determine corresponding content; and a determination circuit, configured to generate an audio/video content recognition result according to the content determined by the AI model (Venkataraman, Col. 16, lines 40-46: "Using a trained machine learning model, the multimodal features, text converted from audio, and/or the context of the audio, may be utilized to determine an emotion, such as anger 516, disgust 518, fear 520, happiness 522, a neutral expression 524, sadness 526, surprise 528, and/or other emotions that a user (e.g., customer and/or agent) may experience."). The same motivation for claim 1 applies equally to claim 8.
Regarding claim 9, Srivastava discloses a processing method of an electronic device, comprising: generating audio data and video data to a speaker and a display panel, respectively (Srivastava, [0037]: "The media device 110 may or may not include a display screen or a projection means, an audio device, and a set of input/output (I/O) devices. The media device 110 may be placed in a closed environment such that the playback of the set of media items through the media device 110 may lie in a field of audio-visual (FOA-V) reception of the audience 112. The media device 110 may comprise at least a first speaker to output an audio signal of the media item 104."); receiving a microphone input from a microphone of the electronic device; detecting the microphone input to generate a user hotspot detection result when the speaker plays the audio data and the display panel shows the video data (Srivastava, [0039]: "The plurality of different types of sensors 114 may comprise suitable logic, circuitry, and interface that may be configured to capture a plurality of different types of input signals from an audience (e.g., the test audience 116 or the audience 112), at the playback of a media item (e.g., the media item 104). The captured plurality of different types of input signals may correspond to emotional response data associated with the audience 112 at the playback of the media item 104. The plurality of different types of sensors 114 may include the set of audio sensors 114A, the set of image sensors 114B, and the set of biometric sensors 114C."; [0040]: "The set of audio sensors 114A may be a set of microphones"); and storing the user hotspot detection result (Srivastava, [0050]: "With the initialization of the playback of the media item 104 at the media device 110, the plurality of different types of sensors 114 may be activated to capture a set of emotional responses continuously at a playback duration of the media item 104."); and generating the user hotspot detection result according to the user emotion detection result and the VAD result (Srivastava, [0030]: "the control circuitry may be further configured to generate an amalgamated audience response signal for the media item. Such amalgamated audience response signals may be generated based on the synchronized plurality of different input signals and a plurality of weights assigned to the plurality of different types of sensors. The control circuitry may further identify a set of common positive peaks and a set of common negative peaks in each of the plurality of different types of input signals based on the overlay of the plurality of different types of input signals. A plurality of highlight points and a plurality of lowlight points may be further calculated by the control circuitry for a plurality of scenes of the media item, based on the identified set of common positive peaks and the set of common negative peaks."). However, Srivastava fails to expressly recite wherein the step of detecting the microphone input to generate the user hotspot detection result when the speaker plays the audio data and the display panel shows the video data comprises: cancelling or reducing an echo or the environment noise to generate a clean microphone input; generating a user emotion detection result indicating which user emotion the clean microphone input corresponds to; and detecting if the clean microphone input comprises human voice or human speech, and detecting strength of the human voice or the human speech, to generate a VAD result, wherein the VAD result comprises information indicating at which level the strength of the human voice or the human speech is located.
Venkataraman teaches wherein the step of detecting the microphone input to generate the user hotspot detection result when the speaker plays the audio data and the display panel shows the video data comprises: cancelling or reducing an echo or the environment noise to generate a clean microphone input (Venkataraman, Col. 16, lines 17-26: "Once each input has been received, for example by an emotion prediction system, the input may be processed by multimodal pre-processing 510. Such pre-processing may, depending on the type of input, include performing noise reduction in images or audio, correcting misspelled words (while preserving the misspellings for emotion prediction purposes), correcting grammatical errors (while preserving the grammatical errors for emotion prediction purposes), determining an endpoint of speech, and/or reducing background sounds in audio."); generating a user emotion detection result indicating which user emotion the clean microphone input corresponds to (Venkataraman, Col. 16, lines 40-46: "Using a trained machine learning model, the multimodal features, text converted from audio, and/or the context of the audio, may be utilized to determine an emotion, such as anger 516, disgust 518, fear 520, happiness 522, a neutral expression 524, sadness 526, surprise 528, and/or other emotions that a user (e.g., customer and/or agent) may experience.").
Srivastava and Venkataraman are analogous arts because they both belong to the same field of emotion detection. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the audience response capture system of Srivastava to incorporate the teachings of Venkataraman to clean an input and detect emotion from the input using Mel-scale frequency cepstral coefficient features. This allows for a multimodal input to be used when detecting user emotions (Venkataraman, Col. 1, Brief Summary). This enables the system to improve user experiences by more effectively identifying and responding to emotions. However, Srivastava, in view of Venkataraman, fails to expressly recite detecting if the clean microphone input comprises human voice or human speech, and detecting strength of the human voice or the human speech, to generate a VAD result, wherein the VAD result comprises information indicating at which level the strength of the human voice or the human speech is located.
Rastrow teaches detecting if the clean microphone input comprises human voice or human speech, and detecting strength of the human voice or the human speech, to generate a VAD result, wherein the VAD result comprises information indicating at which level the strength of the human voice or the human speech is located (Rastrow, [0198]: “The interrupt detector 1010 and/or the device directed classifier 1020 may determine whether device-directed speech is detected using at least one classifier. For example, the classifier may consider a variety of information, including a volume level of the speech (e.g., whether the speech is louder or quieter than normal speech), speaker identification (e.g., identification data) corresponding to the speech, which may be determined using voice recognition, image data, and/or other techniques known to one of skill in the art, emotion detection (e.g., emotion data) corresponding to the speech (e.g., whether the speech is animated or quiet, for example), a length of time between when the output audio begins and when the speech is detected, and/or the like without departing from the disclosure. The classifier may process the input information and generate model output data, which may indicate a likelihood that the speech is directed to the device.”; [0046]: “Some embodiments may apply voice activity detection (VAD) techniques. Such techniques may determine whether speech is present in audio data based on various quantitative aspects of the audio data, such as the spectral slope between one or more frames of the audio data; the energy levels of the audio data in one or more spectral bands; the signal-to-noise ratios of the audio data in one or more spectral bands; or other quantitative aspects.”).
Srivastava, Venkataraman, and Rastrow are analogous arts because they each belong to the same field of audio processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the audience response capture system of Srivastava, as modified by the emotion prediction system of Venkataraman, to incorporate the teachings of Rastrow to detect the strength of a human voice in the audio input. Detecting strength of a human voice allows the system to differentiate between voices the system is intended to pick up and voices that may be additional background noise (Rastrow, [0198]). This ensures that the system can determine if a human voice constitutes useful data, which improves the accuracy of the system.
Regarding claim 13, the rejection of claim 9 is incorporated. Srivastava, in view of Venkataraman and Rastrow, discloses all of the elements of the current invention as stated above. Srivastava further discloses wherein the user hotspot detection result comprises information of the user emotion and corresponding timing of audio/video content (Srivastava, [0029]: "the control circuitry may be further configured to normalize the received plurality of different types of input signals. Such normalization may be done based on the determined peak emotional response level for each emotional response by each user at the playback of the media item. The received plurality of different types of input signals may be normalized further based on a geographical region of the audience. The control circuitry may further synchronize and overlay the normalized plurality of different input signals in a timeline, which may be same as a playback timeline of the media item.").
Regarding claim 14, the rejection of claim 9 is incorporated. Srivastava, in view of Venkataraman and Rastrow, discloses all of the elements of the current invention as stated above. Venkataraman further teaches recognizing audio content corresponding to the audio data to generate an audio/video content recognition result; and storing the audio/video content recognition result (Venkataraman, Col. 16, lines 35-39: "If the multimodal input includes audio, the audio may be transmitted to ASR and embedding 514. The ASR and embedding 514 may convert the audio to text and, using a trained machine learning model, generate context vectors from the audio and resulting text."). The same motivation for claim 9 applies equally to claim 14.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TYLER J BECKER whose telephone number is (703)756-1271. The examiner can normally be reached M-Th, 7:15am-5:45pm PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached at (571) 272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/TYLER BECKER/Examiner, Art Unit 2657
/DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657