DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on March 3, 2026 has been entered.
Response to Arguments
Applicant's arguments and amendments filed in the Amendment with RCE filed March 3, 2026 (herein “Amendment”) in response to the rejection of claims 1–3 and 5–10 have been fully considered but they are not persuasive. First, on page 9, Applicant sets forth that the amended limitations are directed towards aspects of the invention of “hand detection in concert with all the other signals for identifying a sound signal containing a voice component coming from a specific region.” To this end, the independent claims exactly recite in relevant part [the processing circuit is] “arranged to determine whether a speaker changes for determining whether to select another region from the multiple regions as the specific region according to the voice characteristic value of the main sound and a subsequent hand gesture detection result from the other region.” Therefore, to the extent it is claimed, the “all the other signals” are simply “the voice characteristic value of the main sound” and “a subsequent hand gesture.” Further, Applicant argues on page 12 that Gutta “does not teach that a recognition result … is used in conjunction with a voice characteristic and a gesture detection result to identify a current speaker.”
However, the primary reference Gutta teaches in col. 9, ll. 49–67, col. 10, ll. 62–67, and col. 7, l. 42–col. 8, l. 65, that visual cue identifier and audio cue identifier provide cues to the predictive speaker identifier, each speaker corresponding to one of multiple speaker locations in a room (multiple regions, including another different region from the specific region), and that the predictive speaker identifier uses the cues (both types – visual and audio) via a decision tree to predict the next speaker according to rules considering face detection, hand gesture and a grunt or clearing of the throat. Further, Gutta in col. 7, ll. 25–41 teaches directing the PTZ camera to focus on the predicted next speaker per the determined location of the speaker via main sound from sound detection. Still further, to the extent Gutta uses “or” language in discussing “to detect any event having associated acoustic or visual cues exhibited by a person,” (such as in col. 11, ll. 22–39), Gutta is disclosing that both visual and acoustic cues are employed as given in the table shown in col. 11, ll. 29–39 showing that a set of cues including both visual and acoustic are evaluated for detecting an event, where the events include a change in speakers as disclosed earlier in Gutta. Therefore, cited primary reference Gutta teaches the claimed “arranged to determine whether a speaker changes for determining whether to select another region from the multiple regions as the specific region according to the voice characteristic value of the main sound and a subsequent hand gesture detection result from the other region” as indicated in the updated rejection rationale below.
Further on pages 9–11 of the Amendment, applicant sets forth that “gesture lock means that another detection will not be performed immediately after identifying a current speaker, such that highlighting of a specific region will not constantly switch,” and cites to support for this aspect of the invention in ¶¶ 38–39, and elsewhere in the Specification reciting “Step 314 may be performed after a relatively long period of detection.” However, neither the cited paragraphs of the Specification, nor the claims recite “another detection will not be performed immediately..” nor do the claims recite such limitations. To be sure, broad disclosure of the Specification including relative terms of degree of “after a relatively long period of detection,” does not clearly teach “another detection will not be performed immediately,” considering that “relatively long period” is relative and fails to state with respect to what attributes is the “period of detection” considered relatively long. At least, a PHOSITA understands that computer system latencies would be present in such an automated voice/image sensing system, and would be part of that “period of detection.” Notwithstanding, although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims. See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).
On page 12, and then again on pages 13–14, Applicant further argues regarding secondary reference Kim, that Kim does not teach its gesture lock to include “receiving gesture lock results when continuing to perform a particular operation.” However, Kim is on only relied upon for teaching a “gesture lock” and Kim’s teachings are not viewed alone, but in combination with the teachings of Gutta according to the motivation of record. In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references. See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986).
Finally on page 14, Applicant argues that “If Gutta were modified to include the gesture locking as taught by Kim, this quick processing will be prevented such that the current speaker will continue to be highlighted until a next person has started speaking. The modification of Gutta would therefore teach against the stated objectives of the prior art.” However, in proposing this combination of Kim and Gutta, the Applicant suggests that modifying Gutta by Kim would necessarily feature “quick processing will be prevented,” without citing to portions of Gutta or Kim to support this proposition. Moreover, a modification of Gutta to include a gesture lock would leave open the duration of that gesture lock such that the stated objectives of Gutta remain operational and intact. Besides, the stated objective of Gutta is to predict events (such as a change in speakers) using acoustic and visual cues (see Gutta Abstract), which would not be rendered unsatisfactory simply by adding a gesture lock.
Therefore in view of the above, while all of Applicants arguments and amendments have been fully considered, they are not found persuasive, and the rejection of all pending claims under the combinations of the currently cited art is maintained, with updates to the rejection rationale below where appropriate.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 5–6 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention. Claims 5–6 both recite the limitations "the SoC further comprises: a voice activity detection circuit" and “a voice component according to the multiple sound signals” however, “voice activity detection circuit” and “voice component” are earlier introduced in claim 2 from which claims 5 and 6 depend from their dependence upon claim 3. Therefore it is unclear and indefinite whether the newly introduced voice activity detection circuit and voice component recited in claims 5 and 6 are an additional element, or are the same one earlier introduced in claim 2.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 7–8 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Gutta et al., US Patent No. US 6,894,714 B2 (herein “Gutta”) in view of Bytyqi, US Patent No. 10,440,327 B1 (herein “Bytyqi”) and further in view of Kim et al., US Patent Application Publication No. US 2015/0182160 A1 (herein “Kim”).
Regarding claim 1, with deficiencies of Gutta noted in square brackets [], Gutta teaches a system [on chip (SoC)], arranged to perform partial highlighting of a current speaker with aid of hand gesture detection, comprising (Gutta col. 3, ll. 17–32, and col. 2, ll. 43–54, video processing system 10 processing audio and video information to detect cues including visual cues such as a hand gesture to detect an event exhibited by a person in a conference and focus a camera and obtain images of the predicted speaker, where col. 4, l. 59 – col. 5, l. 30 teach detection and tracking operations upon an object of interest as a current speaker):
a person recognition circuit, arranged to obtain an image data from an image capturing device (Gutta col. 3, ll. 33–43 and col. 4, ll. 23–25, elements of system 10 may be implemented as an circuit card, where system 10 includes a processor connected to a system bus that is also connected to pan-tilt-zoom PTZ camera 18), and perform person recognition upon the image data to generate a recognition result (Gutta col. 4, ll. 51–58, PTZ camera generates an image including an object of interest such as videoconference participant 22-k and other objects such as videoconference participant 22-k+1, the image supplied as video input to the detection and tracking operation which detects and tracks the object of interest (participant – person) using well-known conventional detection and tracking techniques, including as disclosed in col. 5, l. 56–col. 6, l. 21, face recognition and assigning a unique identifier to each face detected), wherein the recognition result comprises multiple regions, and each of the multiple regions comprises a person (Gutta col. 3, ll. 45–56, and col. 5, ll. 1–16, PTZ camera and video conferencing system accommodates conference participants 22-1 through 22-N (for example three people as shown in fig. 2 arranged in three different regions of the room), and will track one of the participants of interest by zooming in on their area);
a hand gesture detection circuit, arranged to obtain the image data from the image capturing device (Gutta col. 3, ll. 33–43 and col. 4, ll. 23–25, elements of system 10 may be implemented as an circuit card, where system 10 includes a processor connected to a system bus that is also connected to pan-tilt-zoom PTZ camera 18), and perform hand gesture detection upon a hand gesture image data in the image data, to generate a hand gesture detection result (Gutta col. 7, ll. 12–16 and 45–51, video image from PTZ camera provided to predictive speaker identifier which processes the video to identify predefined visual cues including hand gestures, where col. 10, ll. 45–52 teaches that the hand gestures can be obtained using techniques described in a cited academic paper), wherein the hand gesture detection result indicates that a predetermined hand gesture is detected (Gutta col. 10, ll. 46–53, col. 9, l. 49–col. 10, l. 10, visual cue identifier identifying one or more predefined visual cues including hand gestures such as a predefined “raised hand or finger” gesture);
a sound detection circuit, arranged to receive multiple sound signals from multiple microphones (Gutta col. 3, ll. 33–56 and col. 4, ll. 23–25, elements of system 10 may be implemented as an circuit card, where system 10 includes a processor connected to a system bus that is also connected to an array of microphones 16 for capturing audio information), and determine a voice characteristic value of a main sound (Gutta col. 6. ll. 25–43, and col. 10, l. 62–col. 11, l. 20, time-stamped signals generated by the microphone array are passed to a speaker identification module and an audio locator which obtains directional information identifying the pan and tilt angles associated with the sound source, where such characteristic sounds identifying a predicted speaker include a grunt or clearing of the throat, or particular phrases indicating the interruption of a current speaker, and also acoustic and prosodic features); and
a processing circuit, coupled to the person recognition circuit, the hand gesture detection circuit, and the sound detection circuit (Gutta col. 3, ll. 33–56 and col. 4, ll. 23–25, elements of system 10 may be implemented as an circuit card, where system 10 includes a processor connected via a bus to the other system components), and arranged to determine a specific region in the image data corresponding to a current speaker according to the recognition result, the gesture detection result, and the voice characteristic value of the main sound (Gutta col. 5, ll. 25–32, optical zooming operation zooms in (highlight) and centers on the detected object of interest (specific region), where col. 7, ll. 25–41, in discussing fig. 3 teaches directing the PTZ camera to focus on the predicted next speaker per the determined location of the speaker using the audio and video information as described above, and where col. 5, ll. 35–35 also discussing fig. 3, teach that the adaptive position locator processing the audio and video determines the location of a speaker (current speaker)), and process the image data to highlight the specific region (Gutta col. 7, ll. 3–11, a bounding box is computed and used to focus the PTZ camera, where col. 5, ll. 5–30, teaches the zooming is on the detected object of interest such as a current speaker), wherein the processing circuit selects a region from the multiple regions as the specific region according to the voice characteristic value of the main sound and the hand gesture detection result (Gutta col. 4, l. 50–col. 5, l. 4, detection and tracking of an object of interest which in turn controls the zooming of the PTZ camera (thus selecting one of the conference participant regions to focus on) using audio location (voice characteristic value of the main sound) and motion detection such as gesturing, where col. 10, ll. 46–53 teaches the gesturing as hand gestures), and the processing circuit is further arranged to enable [a gesture lock] for the specific region for indicating to keep highlighting the specific region (Gutta col. 6, l. 44 – col. 7, l. 10, audio and video signals are accumulated for a defined interval so as to be able to collect data for meaningful events, where motion heuristics are used to identify only frame portions having a significant amount of motion so that false events that don’t indicate a speaker change are suppressed, and the camera is kept focused (highlighting) on the current speaker (the specific region)), and arranged to determine whether a speaker changes for determining whether to select another region from the multiple regions as the specific region (Gutta col. 7, ll. 10–40, the focus of the PTZ camera on one of the conference participants, which selects another region (the region of the next speaker) as the specific region, is determined by the audio cues and video cues determined from processing audio signals generated by the array and video image from the PTZ camera) according to the voice characteristic value of the main sound and a subsequent hand gesture detection result from the other region (Gutta col. 9, ll. 49–67, col. 10, ll. 62–67, and col. 7, l. 42–col. 8, l. 65, discloses that visual cue identifier and audio cue identifier provide cues to the predictive speaker identifier, each speaker corresponding to one of multiple speaker locations in a room (multiple regions, including another different region from the specific region), the predictive speaker identifier using the cues via a decision tree to predict the next speaker according to rules considering face detection, hand gesture and a grunt or clearing of the throat, where col. 7, ll. 25–41 teaches directing the PTZ camera to focus on the predicted next speaker per the determined location of the speaker via main sound from sound detection).
While Gutta teaches elements of system 10 may be implemented as an circuit card, Gutta does not explicitly teach the system to be a system on a chip (SoC). Bytyqi teaches a system on a chip (SoC) (Bytyqi col. 5, ll. 13–19, video communication device including a processor that is a high power System on a chip (SoC)).
Still further, Gutta does not but Kim teaches a gesture lock (Kim ¶¶196–197, gesture locking state is set where any future gestures are not input into the electronic device for detecting), and disables the gesture lock (Kim ¶196, gesture unlocking can be requested so the device is in a gesture process standby state that will process gesture inputs).
Therefore, taking the teachings of Gutta and Bytyqi together as a whole, it would have been obvious to a person having ordinary skill in the art (herein “PHOSITA”) before the effective filing date of the claimed invention to have modified the video conferencing apparatus of Gutta to be integrated as a system on a chip as disclosed in Bytyqi at least because doing so would provide for video conferencing processing in a single standalone device, offering a seamless video-conferencing experience without requiring the user to manage large, cumbersome and difficult to assemble multiple devices to have a videoconferencing system. See Bytyqi col. 1, ll. 23–28, and 32–41, and Abstract.
Further, taking the teachings of Gutta as modified above by Bytyqi, and Kim together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the video conferencing apparatus of Gutta to include the gesture locking and unlocking teachings as disclosed and cited above in Kim at least because doing so would allow for greater control over electronic devices by providing a meaningful service based on a result inferred from context information instead of simply monitoring one sensor input. See Kim ¶¶2, 57.
Regarding claim 7, Gutta teaches wherein the processing circuit processes the image data to magnify a person within the specific region (Gutta col. 5, ll. 1–31, PTZ camera tracks one of the participants of interest by zooming in on their face, where fig. 2 illustrates the zooming operation and magnification effects).
Regarding claim 8, Gutta teaches wherein the voice characteristic value of the main sound is a voiceprint or an azimuth of the main sound (Gutta col. 6, ll. 25–43, and col. 7, ll. 3–9, audio locator used to determine the position of a particular conference participant, and identify the pan and tilt angles (azimuth) and directional information to determine the bounding box for the PTZ camera to zoom in on).
Regarding claim 10, with deficiencies of Gutta noted in square brackets [], Gutta teaches a video processing method, arranged to perform partial highlighting of a current speaker with aid of hand gesture detection, comprising (Gutta col. 3, ll. 17–32, and col. 2, ll. 43–54, video processing system 10 processing audio and video information to detect cues including visual cues such as a hand gesture to detect an event exhibited by a person in a conference and focus a camera and obtain images of the predicted speaker, where col. 4, l. 59 – col. 5, l. 30 teach detection and tracking operations upon an object of interest as a current speaker):
[disposing a plurality of devices on a System on Chip (SoC) to perform the following steps:]
obtaining an image data from an image capturing device (Gutta col. 3, ll. 33–43 and col. 4, ll. 23–25, elements of system 10 may be implemented as an circuit card, where system 10 includes a processor connected to a system bus that is also connected to pan-tilt-zoom PTZ camera 18), and performing person recognition upon the image data to generate a recognition result (Gutta col. 4, ll. 51–58, PTZ camera generates an image including an object of interest such as videoconference participant 22-k and other objects such as videoconference participant 22-k+1, the image supplied as video input to the detection and tracking operation which detects and tracks the object of interest (participant – person) using well-known conventional detection and tracking techniques, including as disclosed in col. 5, l. 56–col. 6, l. 21, face recognition and assigning a unique identifier to each face detected), wherein the recognition result comprises multiple regions, and each of the multiple regions comprises a person (Gutta col. 3, ll. 45–56, and col. 5, ll. 1–16, PTZ camera and video conferencing system accommodates conference participants 22-1 through 22-N (for example three people as shown in fig. 2 arranged in three different regions of the room), and will track one of the participants of interest by zooming in on their area);
obtaining the image data from the image capturing device (Gutta col. 3, ll. 33–43 and col. 4, ll. 23–25, elements of system 10 may be implemented as an circuit card, where system 10 includes a processor connected to a system bus that is also connected to pan-tilt-zoom PTZ camera 18), and performing hand gesture detection upon a hand gesture image data in the image data, to generate a hand gesture detection result (Gutta col. 7, ll. 12–16 and 45–51, video image from PTZ camera provided to predictive speaker identifier which processes the video to identify predefined visual cues including hand gestures, where col. 10, ll. 45–52 teaches that the hand gestures can be obtained using techniques described in a cited academic paper), wherein the hand gesture detection result indicates that a predetermined hand gesture is detected (Gutta col. 10, ll. 46–53, col. 9, l. 49–col. 10, l. 10, visual cue identifier identifying one or more predefined visual cues including hand gestures such as a predefined “raised hand or finger” gesture);
receiving multiple sound signals from multiple microphones (Gutta col. 3, ll. 33–56 and col. 4, ll. 23–25, elements of system 10 may be implemented as an circuit card, where system 10 includes a processor connected to a system bus that is also connected to an array of microphones 16 for capturing audio information), and determining a voice characteristic value of a main sound (Gutta col. 6. ll. 25–43, and col. 10, l. 62–col. 11, l. 20, time-stamped signals generated by the microphone array are passed to a speaker identification module and an audio locator which obtains directional information identifying the pan and tilt angles associated with the sound source, where such characteristic sounds identifying a predicted speaker include a grunt or clearing of the throat, or particular phrases indicating the interruption of a current speaker, and also acoustic and prosodic features); and
determining a specific region corresponding to a current speaker in the image data according to the recognition result, the gesture detection result, and the voice characteristic value of the main sound (Gutta col. 5, ll. 25–32, optical zooming operation zooms in (highlight) and centers on the detected object of interest (specific region), where col. 7, ll. 25–41, in discussing fig. 3 teaches directing the PTZ camera to focus on the predicted next speaker per the determined location of the speaker using the audio and video information as described above, and where col. 5, ll. 35–35 also discussing fig. 3, teach that the adaptive position locator processing the audio and video determines the location of a speaker (current speaker)), and processing the image data to highlight the specific region (Gutta col. 7, ll. 3–11, a bounding box is computed and used to focus the PTZ camera, where col. 5, ll. 5–30, teaches the zooming is on the detected object of interest such as a current speaker), comprising: selecting a region from the multiple regions as the specific region according to the voice characteristic value of the main sound and the hand gesture detection result (Gutta col. 4, l. 50–col. 5, l. 4, detection and tracking of an object of interest which in turn controls the zooming of the PTZ camera (thus selecting one of the conference participant regions to focus on) using audio location (voice characteristic value of the main sound) and motion detection such as gesturing, where col. 10, ll. 46–53 teaches the gesturing as hand gestures), enabling [a gesture lock] for the specific region for indicating to keep highlighting the specific region (Gutta col. 6, l. 44 – col. 7, l. 10, audio and video signals are accumulated for a defined interval so as to be able to collect data for meaningful events, where motion heuristics are used to identify only frame portions having a significant amount of motion so that false events that don’t indicate a speaker change are suppressed, and the camera is kept focused (highlighting) on the current speaker (the specific region)); and determining whether a speaker changes for determining whether to select another region from the multiple regions as the specific region (Gutta col. 7, ll. 10–40, the focus of the PTZ camera on one of the conference participants, which selects another region (the region of the next speaker) as the specific region, is determined by the audio cues and video cues determined from processing audio signals generated by the array and video image from the PTZ camera) according to the voice characteristic value of the main sound and a subsequent hand gesture detection result from the other region (Gutta col. 9, ll. 49–67, col. 10, ll. 62–67, and col. 7, l. 42–col. 8, l. 65, discloses that visual cue identifier and audio cue identifier provide cues to the predictive speaker identifier, each speaker corresponding to one of multiple speaker locations in a room (multiple regions, including another different region from the specific region), the predictive speaker identifier using the cues via a decision tree to predict the next speaker according to rules considering face detection, hand gesture and a grunt or clearing of the throat, where col. 7, ll. 25–41 teaches directing the PTZ camera to focus on the predicted next speaker per the determined location of the speaker via main sound from sound detection).
While Gutta teaches elements of system 10 may be implemented as an circuit card, Gutta does not explicitly teach, where Bytyqi teaches disposing a plurality of devices on a System on Chip (SoC) to perform the following steps (Bytyqi col. 5, ll. 13–19, video communication device including a processor that is a high power System on a chip (SoC)).
Still further, Gutta does not but Kim teaches a gesture lock (Kim ¶¶196–197, gesture locking state is set where any future gestures are not input into the electronic device for detecting), and disables the gesture lock (Kim ¶196, gesture unlocking can be requested so the device is in a gesture process standby state that will process gesture inputs).
Therefore, taking the teachings of Gutta and Bytyqi together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the video conferencing apparatus of Gutta to be integrated as a system on a chip as disclosed in Bytyqi at least because doing so would provide for video conferencing processing in a single standalone device, offering a seamless video-conferencing experience without requiring the user to manage large, cumbersome and difficult to assemble multiple devices to have a videoconferencing system. See Bytyqi col. 1, ll. 23–28, and 32–41, and Abstract.
Further, taking the teachings of Gutta as modified above by Bytyqi, and Kim together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the video conferencing apparatus of Gutta to include the gesture locking and unlocking teachings as disclosed and cited above in Kim at least because doing so would allow for greater control over electronic devices by providing a meaningful service based on a result inferred from context information instead of simply monitoring one sensor input. See Kim ¶¶2, 57.
Claims 2–3 and 5–6 are rejected under 35 U.S.C. 103 as being unpatentable over Gutta in view of Bytyqi in view of Kim, as set forth above regarding claim 1, and further in view of Veselinovic et al., US Patent Application Publication No. US 2021/0120335 A1 (herein “Veselinovic”).
Regarding claim 2, with deficiencies of Gutta noted with square brackets [], Gutta teaches further comprising: [a voice activity detection circuit, arranged to determine whether at least one part of the multiple sound signals comprises a voice component according to the multiple sound signals;] wherein according to whether the at least one part of the multiple sound signals comprises the voice component, the processing circuit determines the specific region in the image data according to the recognition result, the hand gesture detection result, and the voice characteristic value of the main sound (Gutta col. 5, ll. 25–32, optical zooming operation zooms in (highlight) and centers on the detected object of interest as a conference participant (specific region), where col. 7, l. 42–col. 8, l. 65, discloses that visual cue identifier and audio cue identifier provide cues to the predictive speaker identifier (thus the sound signal comprising a voice component of the conference participant) which uses the cues via a decision tree to predict the next speaker, and direct the PTZ camera, where col. 7, ll. 25–41 teaches directing the PTZ camera to focus on the predicted next speaker per the determined location of the speaker).
Gutta does not explicitly teach, but Veselinovic teaches a voice activity detection circuit, arranged to determine whether at least one part of the multiple sound signals comprises a voice component according to the multiple sound signals (Veselinovic ¶¶70, 76, voice activity detector determines whether there is voice present in an audio signal, where the VAD is part of a conference room system recording multiple human speakers).
Therefore, taking the teachings of Gutta as modified above by Bytyqi and Kim, and Veselinovic together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the video conferencing apparatus of Gutta to include the voice activity detection teachings as disclosed and cited above in Veselinovic at least because doing so would provide for a higher quality sound capture and more optimal coverage of sound in a conference environment. See Veselinovic ¶¶7, 8, and 3.
Regarding claim 3, with deficiencies of Gutta noted with square brackets [], Gutta teaches wherein [when the voice activity detection circuit indicates that the at least one part of the multiple sound signals comprises the voice component], the processing circuit determines the specific region in the image data according to the recognition result, the gesture detection result, and the voice characteristic value of the main sound, and processes the image data to highlight the specific region (Gutta col. 5, ll. 25–32, optical zooming operation zooms in (highlight) and centers on the detected object of interest as a conference participant (specific region), where col. 7, l. 42–col. 8, l. 65, discloses that visual cue identifier and audio cue identifier provide cues to the predictive speaker identifier (thus the sound signal comprising a voice component of the conference participant) which uses the cues via a decision tree to predict the next speaker, and direct the PTZ camera, where col. 7, ll. 25–41 teaches directing the PTZ camera to focus on the predicted next speaker per the determined location of the speaker).
Gutta does not explicitly teach, but Veselinovic teaches when the voice activity detection circuit indicates that the at least one part of the multiple sound signals comprises the voice component (Veselinovic ¶¶61–63, in response to (when) a value of the voice activity detection information related to a new sound activity meets a cost function threshold, the new sound will be considered a relevant candidate location for which audio beamforming will be adjusted).
Therefore, taking the teachings of Gutta as modified above by Bytyqi and Kim, and Veselinovic together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the video conferencing apparatus of Gutta to include the voice activity detection teachings as disclosed and cited above in Veselinovic at least because doing so would provide for a higher quality sound capture and more optimal coverage of sound in a conference environment. See Veselinovic ¶¶7, 8, and 3.
Regarding claim 5, with deficiencies of Gutta noted with square brackets [], Gutta teaches wherein the recognition result further comprises multiple characteristic values corresponding to the multiple regions, respectively (Gutta col. 5, ll. 46–65, acoustic and visual cues (characteristic values) are identified and associated with a region of interest from among the multiple regions of interest defined by the locations of the faces of the conference participants), and the processing circuit tracks a characteristic value of the specific region to determine a location of the specific region in a subsequent image data (Gutta col. 4, l. 59–col. 5, l. 13, detection and tracking operation tracks the object of interest using motion detection, gesturing, audio location (characteristic value of a region where a conference participant is seated) to determine which conference participant is the speaker and zoom in on a framed portion (specific region in a subsequent image data)), and processes the subsequent image data to highlight the specific region in the subsequent image data (Gutta col. 5, ll. 5–32, optical zooming operation performed on the identified object of interest frames the object of interest such that it is centered in the image and occupies a desired percentage of the image (highlight));
in response to another hand gesture detection result, from the specific region the processing circuit [disables the gesture lock] for the specific region (Gutta col. 7, ll. 12–40, adaptive speaker identifier interacts with predictive speaker identifier (including detecting hand gestures from each of the other meeting participants (from a region different) to predict the identity of the next speaker and the speaker predictions generated are used to focus the PTZ camera on the next speaker, thus removing the focus from the specific region of the current speaker); and [the SoC] further comprises:
[a voice activity detection circuit, arranged to determine whether at least one part of the multiple sound signals comprises a voice component according to the multiple sound signals;]
wherein the processing circuit determines whether a speaker changes for determining whether to select another region from the multiple regions as the specific region according to the multiple characteristic values (Gutta col. 7, ll. 10–40, the focus of the PTZ camera on one of the conference participants, which selects another region (the region of the next speaker) as the specific region, is determined by the audio cues and video cues determined from processing audio signals generated by the array and video image from the PTZ camera) that correspond to the multiple regions, respectively, and are determined by the person recognition circuit, the subsequent hand gesture detection result, the voice characteristic value of the main sound determined by the sound detection circuit (Gutta col. 9, ll. 49–67, col. 10, ll. 62–67, and col. 7, l. 42–col. 8, l. 65, discloses that visual cue identifier and audio cue identifier provide cues to the predictive speaker identifier, each speaker corresponding to one of multiple speaker locations in a room (multiple regions), the predictive speaker identifier using the cues via a decision tree to predict the next speaker according to rules considering face detection, hand gesture and a grunt or clearing of the throat, where col. 7, ll. 25–41 teaches directing the PTZ camera to focus on the predicted next speaker per the determined location of the speaker via main sound from sound detection), and whether the at least one part of the multiple sound signals comprises the voice component determined by [the voice activity detection circuit] (Gutta col. 10, l. 62–col. 11, l. 20, audio cue identifier can identify speech from between different speakers, and who is currently speaking to determine who the next speaker may be).
As noted above in the rejection rationale for claims 4 and 1 from which claim 5 depends, Gutta does not, but Bytqit teaches the system on a chip, and a PHOSITA would be motivated to combine the teachings of Gutta with Bytqit also for the reasons provided in claims 4 and 1.
Further, Gutta does not explicitly teach, but Veselinovic teaches a voice activity detection circuit, arranged to determine whether at least one part of the multiple sound signals comprises a voice component according to the multiple sound signals (Veselinovic ¶¶70, 76, voice activity detector determines whether there is voice present in an audio signal, where the VAD is part of a conference room system recording multiple human speakers).
Still further, Gutta does not but Kim teaches a gesture lock (Kim ¶¶196–197, gesture locking state is set where any future gestures are not input into the electronic device for detecting), and disables the gesture lock (Kim ¶196, gesture unlocking can be requested so the device is in a gesture process standby state that will process gesture inputs).
Therefore, taking the teachings of Gutta as modified above by Bytyqi, and Veselinovic together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the video conferencing apparatus of Gutta to include the voice activity detection teachings as disclosed and cited above in Veselinovic at least because doing so would provide for a higher quality sound capture and more optimal coverage of sound in a conference environment. See Veselinovic ¶¶7, 8, and 3.
Further, taking the teachings of Gutta as modified above by Bytyqi, and Kim together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the video conferencing apparatus of Gutta to include the gesture locking and unlocking teachings as disclosed and cited above in Kim at least because doing so would allow for greater control over electronic devices by providing a meaningful service based on a result inferred from context information instead of simply monitoring one sensor input. See Kim ¶¶2, 57.
Regarding claim 6, with deficiencies of Gutta noted with square brackets [], Gutta teaches wherein the recognition result further comprises multiple characteristic values corresponding to the multiple regions, respectively (Gutta col. 5, ll. 46–65, acoustic and visual cues (characteristic values) are identified and associated with a region of interest from among the multiple regions of interest defined by the locations of the faces of the conference participants), and the processing circuit tracks a characteristic value of the specific region to determine a location of the specific region in a subsequent image data (Gutta col. 4, l. 59–col. 5, l. 13, detection and tracking operation tracks the object of interest using motion detection, gesturing, audio location (characteristic value of a region where a conference participant is seated) to determine which conference participant is the speaker and zoom in on a framed portion (specific region in a subsequent image data)), and processes the subsequent image data to highlight the specific region in the subsequent image data (Gutta col. 5, ll. 5–32, optical zooming operation performed on the identified object of interest frames the object of interest such that it is centered in the image and occupies a desired percentage of the image (highlight)); and
[the SoC] further comprises:
[a voice activity detection circuit, arranged to determine whether at least one part of the multiple sound signals comprises a voice component according to the multiple sound signals];
wherein the processing circuit determines whether a speaker changes for determining whether to select another region from the multiple regions as the specific region according to the multiple characteristic values (Gutta col. 7, ll. 10–40, the focus of the PTZ camera on one of the conference participants, which selects another region (the region of the next speaker) as the specific region, is determined by the audio cues and video cues determined from processing audio signals generated by the array and video image from the PTZ camera) that correspond to the multiple regions, respectively, and are determined by the person recognition circuit, the subsequent hand gesture detection result, the voice characteristic value of the main sound determined by the sound detection circuit (Gutta col. 9, ll. 49–67, col. 10, ll. 62–67, and col. 7, l. 42–col. 8, l. 65, discloses that visual cue identifier and audio cue identifier provide cues to the predictive speaker identifier, each speaker corresponding to one of multiple speaker locations in a room (multiple regions, including from another region different from the specific region), the predictive speaker identifier using the cues via a decision tree to predict the next speaker according to rules considering face detection, hand gesture and a grunt or clearing of the throat, where col. 7, ll. 25–41 teaches directing the PTZ camera to focus on the predicted next speaker per the determined location of the speaker via main sound from sound detection), and
whether the at least one part of the multiple sound signals comprises the voice component determined by [the voice activity detection circuit] (Gutta col. 10, l. 62–col. 11, l. 20, audio cue identifier can identify speech from between different speakers, and who is currently speaking to determine who the next speaker may be), wherein no matter whether [the gesture lock for the specific region has ever been disabled], in response to the subsequent hand gesture result, the processing circuit selects said another region from the multiple regions as the specific region (Gutta col. 7, l. 42–col. 8, l. 53, decision tree constructed to determine what combination of visual and audio cues are to signal a speaker change and hence a zoom focus on another conference participant (selects another region), thus, not requiring an unlock or disabling – just that a decision tree path is established for the speaker change detection, where the decision tree path can include a specific gesture such as a raised finger in the example).
As noted above in the rejection rationale for claims 3 and 1 from which claim 6 depends, Gutta does not, but Bytqit teaches the system on a chip, and a PHOSITA would be motivated to combine the teachings of Gutta with Bytqit also for the reasons provided in claims 3 and 1.
Further, Gutta does not explicitly teach, but Veselinovic teaches a voice activity detection circuit, arranged to determine whether at least one part of the multiple sound signals comprises a voice component according to the multiple sound signals (Veselinovic ¶¶70, 76, voice activity detector determines whether there is voice present in an audio signal, where the VAD is part of a conference room system recording multiple human speakers).
Still further, Gutta does not but Kim teaches a gesture lock and whether the gesture lock for the specific region has ever been disabled (Kim ¶¶196–197, gesture locking state is set where any future gestures are not input into the electronic device for detecting, where gesture unlocking can be requested so the device is in a gesture process standby state that will process gesture inputs).
Therefore, taking the teachings of Gutta as modified above by Bytyqi, and Veselinovic together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the video conferencing apparatus of Gutta to include the voice activity detection teachings as disclosed and cited above in Veselinovic at least because doing so would provide for a higher quality sound capture and more optimal coverage of sound in a conference environment. See Veselinovic ¶¶7, 8, and 3.
Further, taking the teachings of Gutta as modified above by Bytyqi, and Kim together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the video conferencing apparatus of Gutta to include the gesture locking and unlocking teachings as disclosed and cited above in Kim at least because doing so would allow for greater control over electronic devices by providing a meaningful service based on a result inferred from context information instead of simply monitoring one sensor input. See Kim ¶¶2, 57.
Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Gutta in view of Bytyqi in view of Kim, as set forth above regarding claim 1 from which claim 9 depends, and further in view of Tan Wenjun et al., "Dynamic hand gesture recognition using motion trajectories and key frames," 2010 2nd International Conference on Advanced Computer Control, Shenyang, 2010, pp. 163-167, doi: 10.1109/ICACC.2010.5486760 (herein “Tan”).
Regarding claim 9, with deficiencies of Gutta noted in square brackets, Gutta teaches wherein the step of performing the hand gesture detection upon the hand gesture image data in the image data to generate the hand gesture detection result comprises: [performing a human hand recognition upon the image data to generate a human hand recognition result, and obtaining the hand gesture image data from the image data according to the human hand recognition result]; and performing the hand gesture detection upon the hand gesture image data to generate the hand gesture detection result (Gutta col. 7, ll. 12–16 and 45–51, video image from PTZ camera provided to predictive speaker identifier which processes the video to identify predefined visual cues including hand gestures, where col. 10, ll. 45–52 teaches that the hand gestures can be obtained using techniques described in a cited academic paper).
Gutta does not explicitly teach, but Tan teaches performing a human hand recognition upon the image data to generate a human hand recognition result, and obtaining the hand gesture image data from the image data according to the human hand recognition result (Tan page 163, right column, introduction, a system of hand gesture recognition comprised of one part that extracts areas of an image with hands (human hand recognition result), which is detailed in section II, Hand Segmentation on pages 163–164, and a further part that then performs gesture recognition from the hand segmentation results, further detailed in section III, hand gesture recognition).
Therefore, taking the teachings of Gutta and Tan as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the hand gesture detection of Gutta to be include a pre-step of first detecting a human hand as disclosed in Tan, at least because doing so would provide for greater accuracy in gesture detection. See Tan page 166, section IV including results table showing improved recognition rate considering the additional data of key frame detection where a hand is detected.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHELLE M KOETH whose telephone number is (571)272-5908. The examiner can normally be reached Monday-Thursday, 09:00-17:00, Friday 09:00-13:00, EDT/EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vincent Rudolph can be reached at 571-272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
MICHELLE M. KOETH
Primary Examiner
Art Unit 2671
/MICHELLE M KOETH/Primary Examiner, Art Unit 2671