Last updated: May 04, 2026

Application No. 18/624,381

METHOD AND SYSTEM FOR REAL-TIME ACTIVE SPEAKER DETECTION

Non-Final OA §102§103

Filed

Apr 02, 2024

Examiner

MARIAM, DANIEL G

Art Unit

2675

Tech Center

2600 — Communications

Assignee

LENOVO (SINGAPORE) PTE. LTD.

OA Round

1 (Non-Final)

Interview Optional

— +10.3% interview lift. Interview lift (+10.3%) is below the 15.0% threshold. A written response is recommended.

Based on 1185 resolved cases, 2023–2026

Examiner Intelligence

MARIAM, DANIEL G View full profile →

Grants 91% — above average

Career Allowance Rate

1074 granted / 1185 resolved

+28.6% vs TC avg

Moderate +10% lift

Without

With

+10.3%

Interview Lift

resolved cases with interview

Typical timeline

2y 3m

Avg Prosecution

10 currently pending

Career history

1195

Total Applications

across all art units

Statute-Specific Performance

§101

15.8%

-24.2% vs TC avg

§103

33.3%

-6.7% vs TC avg

§102

20.7%

-19.3% vs TC avg

§112

21.0%

-19.0% vs TC avg

Black line = Tech Center average estimate • Based on career data from 1185 resolved cases

Office Action

§102 §103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Notice re prior art available under both pre-AIA  and AIA  
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
Examiner's Note 
Examiner has cited particular columns and line numbers or figures in the references as applied to the claims below for the convenience of the applicant. Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested from the applicant, in preparing the responses, to fully consider the references in entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the examiner.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-5, 8-12, and 15-18 are rejected under 35 U.S.C. 102 (a) (1) as being anticipated by Chaudhuri, et al. (US 10,846,522 B2). Please note, due to the very broad formulation of claim 1 its subject matter is disclosed by a plurality of documents. For procedural efficiency, the examiner has focused the search to prior art that discloses further to independent claim 1.
With regard to claim 1, Chaudhuri, et al. disclose an active speaker detection system, comprising: a visual sensor that captures a visual scene including a first person, i.e., target person (See for example, col. 8, lines 57-65; and col. 9, lines 43-44); and a computer system comprising: one or more computer processors (See for example, Figs. 1-2 and the associated text); and a detection model comprising an audiovisual encoder (See for example, col. 4, lines 64-67; and items 116 and 124 in Fig. 1) and a classifier (See for example, col. 5, lines 1-11; and item 114, in Fig. 1), wherein the computer system is communicably coupled to the visual sensor (See for example, Figs. 1-2 and the associated text) and configured to: obtain a first set of frames and a second set of frames from the visual sensor (See for example, col. 8, line 66 – col. 9, line 21), produce a first embedding and a second embedding from the first set of frames and the second set of frames, respectively, using the audiovisual encoder  (See for example, col. 4, lines 51-67), generate one or more composite embeddings from the first embedding and the second embedding (See for example, col. 7, lines 14-21, and col. 7, lines 32-35), determine, using the classifier, an active speaker detection (ASD) score for each of the one or more composite embeddings (See for example, col. 7, lines 32-52; col. 8, line 66 – col. 9, line 33; and col. 11, lines 47-61), aggregate the one or more ASD scores forming a detection result (See for example, col. 9, lines 22-33), determine whether the first person is speaking based on the detection result, and upon determining that the first person is speaking, adjust a display of the visual scene to focus (via zooming in and/or annotating the video with a bounding box around the face of the current speaker), on the first person (See for example, col. 9, lines 41-65). Thus, each of the requirements of claim 1 is met. 
With regard to claim 2, the active speaker detection system according to claim 1, wherein the determination of whether the first person is speaking corresponds to the second set of frames (See for example, col. Col 8, line 66 – col. 9, line 21; and Fig. 2). 
With regard to claim 3, the active speaker detection system according to claim 1, wherein the second set of frames are temporally after the first set of frames (See for example, col. 8, lines 57-65; and Fig. 2). 
With regard to claims 4, the active speaker detection system according to claim 1, wherein the first embedding and the second embedding each comprise a number of audiovisual feature vectors, and the number of audiovisual feature vectors is equal to a number of frames in the first set or the second set (See for example, col. 5, line 62 – col. 6, line 2; and col. 7, lines 32-52. 
With regard to claim 5, the active speaker detection system according to claim 1, wherein the audiovisual encoder comprises a neural network (See for example, col. 4, lines 51-53); and the classifier comprises a recurrent neural network (See for example, col. 7, line 58 – col. 8, line 12). 
Claim 8 is rejected the same as claim 1 except claim 8 is a method claim. Thus, argument similar to that presented above for claim 1 is applicable to claim 8.
Claim 9, 10, 11, and 12, are rejected the same as claims 2, 3, 4, and 5 respectively, except claims 9, 10, 11, and 12 are method claims. Thus, arguments similar to those presented above for claims  2, 3, 4, and 5 are respectively applicable to claims 9, 10, 11, and 12.
	Claim 15 is rejected the same as claim 8. Thus, argument similar to that presented above for claim 8 is applicable to claim 15. Claim 15 distinguishes from claim 8 only in that it recites 15. A non-transitory computer-readable medium comprising computer-executable instructions that. Fortunately,  Chaudhuri (See for example, col. 4, lines 9-14; and col. 13, lines 22-42) teach this feature.
	Claims 16, 17, and 18 are rejected the same as claims 9, 10 and 11 respectively. Thus, arguments similar to those presented above for claims 9, 10, and 11 are respectively applicable to claims 16, 17, and 18.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 6-7, 13-14, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Chaudhuri, et al. (US 10,846,522 B2) in view of DONTCHEVA, et al. (US 2024/0134597 A1). 
With regard to claim 6, Chaudhuri, et al. (hereinafter “Chaudhuri”) discloses all of the claimed subject matter as already discussed above in paragraph 6, and incorporated herein by reference. Chaudhuri further discloses wherein the detection result comprises a first speaking metric for the first person, the determination of whether the first person is speaking comprises comparing the first speaking metric, i.e., a probability that a person is speaking, to a they are from the same field of endeavor, i.e., active speaker detection (See for example, paragraph 0050).  Before the effective filing date of the claimed invention, it would have been obvious to incorporate the teaching as taught by DONTCHEVA, et al. into the system of Chaudhuri, et al.  and to do so would at least allow detection of an active speaker based on a predefined threshold (See for example, paragraph 0096). Therefore, it would have been obvious to combine Chaudhuri with DONTCHEVA, et al. to obtain the invention as specified in claim 6.
With regard to claim 7,  the active speaker detection system according to claim 6, wherein the visual scene further includes a second person, i.e., multiple people, and the detection result further comprises a second speaking metric for the second person,  and the determination of whether the first person is speaking further comprises: obtaining a status for the first person in response to the speaking metric of the first person being lower than or equal to the threshold; and determining whether the first speaking metric is greater than the second speaking metric and the whether the status of the first person is active, wherein the first person is determined to be speaking in response to the status of the first person being active and the first speaking metric being greater than the second speaking metric (See for example, col. 9, lines 41-65 of Chaudhuri; and paragraph 0096 of DONTCHEVA, et al.).
Claims 13 and 14 are rejected the same as claims 6 and 7 respectively, except claims 13 and 14 are method claims. Thus, arguments similar to those presented above for claims 6 and 7 are respectively applicable to claims 13 and 14.
Claims 19 and 20 are rejected the same as claims 13 and 14 respectively. Thus, argument similar to those presented above for claims 13 and 14 are respectively applicable to claims 19 and 20.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Roth, et al. (AVA Active Speaker: An Audio-Visual Data Set for Active Speaker Detection) (See for example, Figs. 1 and 4 , and the associated text); Tesema, et al. (Efficient Audiovisual Fusion for Active Speaker Detection) (See entire document); and US Patent Application Publication Number 2005/0243167 (See for example, paragraphs 0012, 0015, 0029-0030, 0050, and 0061-0062).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to DANIEL G MARIAM whose telephone number is (571)272-7394. The examiner can normally be reached M-F 7:30-5:00 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, ANDREW MOYER can be reached at (571)272-9523. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DANIEL G MARIAM/Primary Examiner, Art Unit 2675

Read full office action

Prosecution Timeline

Apr 02, 2024

Application Filed

Mar 04, 2026

Non-Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/947,737

Patent 12597281

IMAGE AND SEMANTIC BASED TABLE RECOGNITION

3y 6m to grant Granted Apr 07, 2026

18/299,549

Patent 12584859

IDENTIFYING AUTO-FLUORESCENT ARTIFACTS IN A MULTIPLEXED IMMUNOFLUORESCENT IMAGE

2y 11m to grant Granted Mar 24, 2026

18/128,708

Patent 12579782

METHOD FOR IMAGE PROCESSING

2y 11m to grant Granted Mar 17, 2026

18/346,953

Patent 12579833

IDENTITY DOCUMENT DETECTION WITH CONVOLUTIONAL NEURAL NETWORKS FOR DATA LOSS PREVENTION

2y 8m to grant Granted Mar 17, 2026

18/263,261

Patent 12573200

VIDEO-BASED BEHAVIOR RECOGNITION DEVICE AND OPERATION METHOD THEREFOR

2y 7m to grant Granted Mar 10, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

91%

Grant Probability

99%

With Interview (+10.3%)

2y 3m (~2m remaining)

Median Time to Grant

Low

PTA Risk

Based on 1185 resolved cases by this examiner. Grant probability derived from career allowance rate.