DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1-3, 5, 7, 10, 12-14, 16, 18, 21 and 23 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Publication titled “Error Handling in Multimodal Voice-Enabled Interfaces of Tour-Guide Robots Using Graphical Models” by Prodanov.
With regard to claim 1, Prodanov discloses a method for vision-assisted audio processing in a far field device, comprising:
receiving a video stream (Fig. 8.4(c) on page 122);
detecting a person in the video stream; (See table 7.3 on page 105 and pages 122-123, a robot equipped with a video camera records video for detecting a user presence through face detection);
determining the person is an attentive person based on an attention feature associated with the person, wherein the attention feature indicates the person is paying attention to the far field device (page 123, section 8.3.4, Determination is made that the person is attending the conversation by detecting the person’s face for a preset number of frames or minimum period of time);
applying, in response to determining the person being the attentive person, beamforming to a microphone array of the far field device to enhance reception of audio signals received from a target direction of arrival corresponding to a target direction in which the person is located (page 25, speech enhancement and audio signal capture section discusses performing beam forming for precise audio spatial filtering. Pages 140-142 discloses the specific microphone array used to perform beamforming to target the person speaking); and
page 123, speech recognition section 8.3.5, and
initiating, in response to determining the person being the attentive person, automatic speech recognition on the audio signals received from the target direction of arrival (page 123, speech recognition section 8.3.5, and sections 3.1 and 5.4.2. See also table 7.3 and pages 122-123, Speech recognition is performed in response to the recognized face and the detected audio).
With regard to claim 2, Prodanov discloses the method of claim 1, wherein applying beamforming to the microphone array of the far field device includes at least one of amplifying the audio signals coming from the target direction of arrival or nullifying other audio signals coming from other directions different from the target direction of arrival (page 25, Speech enhancement and audio signal capture section describes the microphone arrays are implemented to reduce noise and thereby relatively amplify the speech part of the audio signal. The beamformer specification is shown on pages 140-142. Fig. 1, DSDA illustration shows an amplified speech signal with noise removed. The microphone array also seeks to minimize audio signal that is outside of the directional sensitivity beam shown in Fig. 2).
With regard to claim 3, Prodanov discloses the method of claim 1, further comprising:
receiving one or more audio signals having one or more frequencies; and wherein applying beamforming to the microphone array of the far field device includes applying different weights to different ones of the one or more frequencies to perform at least one of amplifying the audio signals coming from the target direction of arrival or nullifying other audio signals coming from other directions different from the target direction of arrival (Section 3.1.3 on page 26 describes the weighting of specific frequencies of the audio signal in order to accent the speech audio signal. See also page 25, Speech enhancement and audio signal capture section describes the microphone arrays are implemented to reduce noise and thereby relatively amplify the speech part of the audio signal. The beamformer specification is shown on pages 140-142. Fig. 1, DSDA illustration shows an amplified speech signal with noise removed. The microphone array also seeks to minimize audio signal that is outside of the directional sensitivity beam shown in Fig. 2).
With regard to claim 5, Prodanov discloses the method of claim 1, wherein determining the person is the attentive person further comprises:
detecting the attention feature associated with the person (page 123, section 8.3.4, Determination is made that the person is attending the conversation by detecting the person’s frontal face for a preset number of frames or minimum period of time);
comparing a period of time that the attention feature has been detected against a threshold (page 123, section 8.3.4, The example give is 0.8 seconds that a forward facing face is detected); and
identifying the person as the attentive person in response to determining that the period of time exceeds the threshold (page 123, section 8.3.4, Determination is made that the person is attending the conversation by detecting the person’s frontal face for a preset number of frames or minimum period of time).
With regard to claim 7, Prodanov discloses the method of claim 1, wherein the attention feature comprises at least one of a frontal face of the person, a side face of the person, an eye gaze of the person, a facial expression of the person, or a mouth movement of the person (page 123, section 8.3.4, Determination is made that the person is attending the conversation by detecting the person’s frontal face for a preset number of frames or minimum period of time).
With regard to claim 10, Prodanov discloses the method of claim 1, wherein determining the person is the attentive person further comprises detecting the attention feature associated with the person, comparing a period of time that the attention feature has been detected against a threshold, and identifying the person as the attentive person in response to determining that the period of time exceeds the threshold (page 123, section 8.3.4, Determination is made that the person is attending the conversation by detecting the person’s frontal face for a preset number of frames or minimum period of time. The example give is 0.8 seconds that a forward facing face is detected) ; and
wherein applying beamforming to the microphone array of the far field device includes at least one of amplifying the audio signals coming from the target direction of arrival or nullifying other audio signals coming from other directions different from the target direction of arrival (page 25, Speech enhancement and audio signal capture section describes the microphone arrays are implemented to reduce noise and thereby relatively amplify the speech part of the audio signal. The beamformer specification is shown on pages 140-142. Fig. 1, DSDA illustration shows an amplified speech signal with noise removed. The microphone array also seeks to minimize audio signal that is outside of the directional sensitivity beam shown in Fig. 2).
With regard to claim 12, the discussion of claim 1 applies. Prodanov discloses an apparatus for vision-assisted audio processing in a far field device, comprising: one or more memories; and one or more processors couples with the one or more memories for performing the method recited in claim 1 (See page 65, section 5.3.1 Hardware architecture. The apparatus includes processors with memory for processing video and audio input).
With regard to claims 13-14, 16, 18 and 21, the discussions of claims 2-3, 5, 7 and 10 apply respectively.
With regard to claim 23, the discussions of claims 1 and 12 apply. Prodanov discloses software program for controlling the device and performing the method recited in claim 1 (See page 66, Section 5.3.2 Software architecture).
Allowable Subject Matter
Claims 4, 6, 8-9, 11, 15, 17, 19-20 and 22 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:
With regard to claims 4 and 15, no prior art of record was found to teach the specific claimed steps of:
determining a first location of the person in an image coordinate system of the video stream in response to the person being the attentive person;
converting the first location into a second location of the person in an audio coordinate system of the microphone array; and
determining a target vector toward the second location, wherein the target direction of arrival corresponds to the target vector.
With regard to claims 6 and 17, nor prior art of record was found to teach or fairly suggest the specific steps of:
identifying the attention feature associated with the person in a first video frame of a plurality of video frames of the video stream;
skipping a number of video frames subsequent to the first video frame; and
identifying the attention feature associated with the person in a second video frame of the plurality of video frames of the video stream, wherein the second video frame is after the number of video frames subsequent to the first video frame, wherein a time duration between the first video frame and the second video frame comprises the period of time exceeding the threshold.
With regard to claims 8, 9, 11, 19, 20 and 22 no found prior art of record teaches or fairly suggests the steps of:
detecting an interferer object in the video stream; and
identifying an interferer direction of arrival corresponding to an interferer direction in which the interferer object is located;
wherein applying beamforming to the microphone array of the far field device includes at least one of amplifying the audio signals coming from the target direction of arrival or nullifying interferer audio signals coming from the interferer direction of arrival.
USPN 2019/0050629 to Olgiati discloses a system for determining an interferer in the form of a person or object occluding or obscuring another person or object (See Fig. 3 and paragraphs [0038]-[0041]). However Olgiati does not teach or fairly suggest that the interferer or overlapping detected persons are used for processing audio or microphone array beamforming. Prodanov also doesn’t teach or suggest the determination of or accounting for interferers in the processing of audio data in microphone array beamforming.
Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WESLEY J TUCKER whose telephone number is (571)272-7427. The examiner can normally be reached 9AM-5PM Monday-Friday.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, JOHN VILLECCO can be reached at 571-272-7319. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/WESLEY J TUCKER/Primary Examiner, Art Unit 2661