DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Status of Claims
Claims 1-10 are pending in this application.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-10 and are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 2A, Prong One: The independent claim 1 recites “obtaining a video signal, an audio signal and sound source localization information of a video conference; performing face recognition on a plurality of image frames of the video signal to obtain a plurality of face recognition results; performing voice recognition on a plurality of audio segments of the audio signal to obtain a plurality of voice recognition results at a plurality of timestamps; matching the voice recognition results with the face recognition results according to the sound source localization information, to obtain a plurality of speaker’s identities; performing speech to text transcription on the audio segments of the audio signal to obtain a transcript; attaching the speaker’s identities to the transcript according to the timestamps, to obtain a context; and performing context understanding on the context to obtain a meeting minutes report”.
Claims 1 and 10 recite obtaining audio, video and direction of audio, recognizing/matching face and voice based on direction of each sound, transcribing audio to text, and analyzing context to generate a summary.
[Abstract idea indicators]
Recognizing and matching face and voice -- are decision making steps that are mental processes.
Transcribing speech into text is the conversion of verbal content to written form—a task humans routinely perform mentally or with conventional tools.
Understanding and summarize context are mental processes -- a cognitive process.
Accordingly, the claims are directed to the judicial exception of a mental process.
Step 2A, Prong Two: This judicial exception is not integrated into a practical application. The computer is recited at a high-level of generality (i.e., as performing a generic computer function and being used as an applying) such that it amounts no more than mere instructions to apply the exception using a generic computer. Accordingly, there additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B — Claims Do Not Recite an Inventive Concept That Transforms the Mental Process into Patent-Eligible Subject Matter
The claims add generic, well-understood computer components (memory and processor) and broadly recite use of ‘matching recognition results according to the sound source localization information’ without describing any specific, unconventional structure, algorithmic detail, data structure, or system architecture that provides a concrete technical improvement in computer functionality.
Applying Alice step two and relevant Federal Circuit precedent:
The recitation of conventional computer components (memory and processor) performing routine functions does not supply an inventive concept.
The claims recite high-level, result-oriented steps (e.g., “performing,” “matching,” “attaching”) that describe mental processes rather than specific technical means for performing those processes.
Because the claims lack limitations that tie the mental-process steps to a particular way of achieving a technological improvement (for example, a novel model architecture, specialized data representation, unique training regimen that yields demonstrable technical performance gains, a specialized streaming/decoding pipeline that reduces latency by a quantifiable amount, or hardware/software co-design), the additional elements do not transform the mental processes into significantly more.
Therefore, claims 1 and 10 fail to recite an inventive concept sufficient to transform the judicial exception into patent-eligible subject matter.
With respect to dependent claims 2-9, the claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception.
Conclusion — Rejection
Claims 1-10 are rejected under 35 U.S.C. § 101 as being directed to a judicial exception (mental processes) and failing to recite additional elements that amount to significantly more than the judicial exception.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-8 and 10 are rejected under pre-AIA 35 U.S.C. 103(a) as being unpatentable over Liu (US Pub. 2022/0335949) in view of Zhu et al., (US Pub. 2023/0205985).
Regarding claim 1, Liu discloses a method for generating meeting minutes, comprising:
obtaining a video signal, an audio signal and sound source localization information of a video conference (Figs. 1 and 2, [0131][0132] conference terminals obtain video data in the conference site to obtain video data, audio data and sound source direction information);
performing face recognition on a plurality of image frames of the video signal to obtain a plurality of face recognition results (Fig. 4, [0131][0150][0151] performing facial recognition on the received video data);
performing voice recognition on a plurality of audio segments of the audio signal to obtain a plurality of voice recognition results at a plurality of timestamps (Fig. 4 and [0132][0133][0148][0152][0236] performing voiceprint recognition on the received audio data with timestamps corresponding to each audio segment);
matching the voice recognition results with the face recognition results according to the sound source localization information, to obtain a plurality of speaker’s identities (Fig. 4, [0151][0152][0156] identifying a face and voiceprint in a sound source direction to obtain a face ID and voiceprint ID and determining a speaker identity corresponding to the audio segment);
performing speech to text transcription on the audio segments of the audio signal to obtain a transcript ([0155[0156] performing speech recognition to transcribe audio to text);
attaching the speaker’s identities to the transcript according to the timestamps, to obtain a context ([0102][0156][0184][0189] each text is marked with a speaker ID); and
performing [context understanding] on the context to obtain a meeting minutes report (Fig. 5, [0203] obtaining and displaying Minutes of meeting).
Zhu does explicitly teach including the bracketed limitation:
performing [context understanding] on the context to obtain a meeting minutes report (Zhu, [0045][0086][0087][0476][0489][0490][0512] understanding speech by analyzing the post-processed transcription and which is automatically populated with content from the post-processed transcription to generate a meeting minutes).
Therefore, it would have been obvious to one of ordinary skill before the effective filing date of the claimed invention to incorporate the method of processing conference data based on a sound source direction as taught by Liu with the method of adapting a post-processing system which is configured to perform editing to the meeting transcripts as taught by Zhu to improve the accuracy and readability of the transcript for subsequent downstream operations such as generating summaries (Zhu, [0016]).
Regarding claim 2, Liu in view of Zhu discloses the method of claim 1, and Zhu further discloses:
wherein the voice recognition results and the face recognition results comprise a plurality of known identities and at least one unknown identity (Zhu, Fig. 8, 820, 822, and 824, [0503][0504][0520] determining whether the content 804 is associated with audio and visual data collected from a meeting in which the contributing entities 802 are meeting participants based on the speech and image recognition results).
The previous motivation statement as in claim 1 is still applied.
Regarding claim 3, Liu in view of Zhu discloses the method of claim 1, and Zhu further discloses:
wherein the voice recognition results comprise a plurality of unknown identities, and wherein the face recognition results comprise a plurality of known identities (Zhu, [0445]-[0447] utilizing only audio data and/or only visual data to attribute sub-portions to a user profile for guest user profiles).
The previous motivation statement as in claim 1 is still applied.
Regarding claim 4, Liu in view of Zhu discloses the method of claim 3, and Zhu further discloses:
determining the speaker’s identities from the known identities, according to the sound source localization information and the face recognition results; and updating the unknown identities comprised in the voice recognition results with the speaker’s identities, according to the timestamps (Zhu, [0445][0446] “the guest user profile becomes a known user profile when the guest user profile is tagged with the identity of a meeting participant, thereby linking the user profile and its associated electronic content to the meeting participant”).
The previous motivation statement as in claim 1 is still applied.
Regarding claim 5, Liu in view of Zhu discloses the method of claim 1, and Zhu further discloses:
wherein the voice recognition results comprise a plurality of known identities, and wherein the face recognition results comprise a plurality of unknown identities (Zhu, [0445]-[0447] utilizing only audio data and/or only visual data to attribute sub-portions to a user profile for guest user profiles).
The previous motivation statement as in claim 1 is still applied.
Regarding claim 6, Liu in view of Zhu discloses the method of claim 1, and Liu further discloses:
wherein the sound source localization information comprises at least one of an angle and a direction of each of sound sources (Liu, [0030][0034] capturing target images using a director camera and collecting sound source localization information from each directions).
Regarding claim 7, Liu in view of Zhu discloses the method of claim 1, and Zhu further discloses:
wherein the face recognition results comprise coordinates of a plurality of facial bounding boxes in the image frames and an identity corresponding to each of the facial bounding boxes (Zhu, Fig. 9, 0545)[0546] detecting and identifying faces corresponding to each of the detected areas, 950 A-950 D).
The previous motivation statement as in claim 1 is still applied.
Regarding claim 8, Liu in view of Zhu discloses the method of claim 1, and Liu further discloses:
obtaining a text input associated with at least one user profile; inserting the text input into the transcript according to time series, to generate an updated transcript; attaching the speaker’s identities to the updated transcript according to the timestamps, to obtain the context (Liu, [0003][0184] obtaining a statement text made by each person during a conference).
Liu does not explicitly teach however Zhu does explicitly teach:
performing the context understanding on the context to obtain the meeting minutes report (Zhu, [0045][0086][0087][0476][0489][0490][0512] understanding speech by analyzing the post-processed transcription and which is automatically populated with content from the post-processed transcription to generate a meeting minutes).
The previous motivation statement as in claim 1 is still applied.
Regarding claim 10, claim 10 is the corresponding system claim to method claim 1. Therefore, claim 10 is rejected using the same rationale as applied to claim 1 above.
Claim 9 is rejected under pre-AIA 35 U.S.C. 103(a) as being unpatentable over Liu (US Pub. 2022/0335949) in view of Zhu et al., (US Pub. 2023/0205985) and further in view of Wasserblat et al., (US Pub. 2012/0215535).
Regarding claim 9, Liu in view of Zhu discloses the method of claim 1, and Liu further discloses:
performing summary extraction on the updated context, to obtain the meeting minutes report (Liu, Fig. 5, [0203] obtaining and displaying Minutes of meeting).
Liu in view of Zhu does not explicitly teach however Wasserblat does explicitly teach:
performing the context understanding on the context, to obtain a plurality of emotional semantics of a plurality of sentences comprised in the context; removing a portion of the context according to the emotional semantics, to generate an updated context (Figs. 1, 4 and 5, interaction analytics 136, [0062][0070][0085]-[0087] filtering emotional words by using semantic inference).
Therefore, it would have been obvious to one of ordinary skill before the effective filing date of the claimed invention to incorporate the method of processing conference data based on a sound source direction as taught by Liu in view of Zhu with the method of filtering out emotional words as taught by Wasserblat to provide suitable interactions for further analysis (Wasserblat, [0008]).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Please see attached form PTO-892.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SEONG-AH A. SHIN whose telephone number is (571)272-5933. The examiner can normally be reached 9 AM-3PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
Seong-ah A. Shin
Primary Examiner
Art Unit 2659
/SEONG-AH A SHIN/Primary Examiner, Art Unit 2659