Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-3 and 5—12 and 14-20 are rejected under 35 U.S.C. 103 as being unpatentable over Qin 20200349953 in view of Diamant (20190341050)
Qin discloses a computer-implemented method for automatic conversation transcription, (Section 0001 conversation and generate speaker transcript) comprising:
receiving audio streams from at least one audio source (Section 0031, lines 3-6 “receive audio streams” Fig. 5 section 510);
generating speech based on voice activity detection; (Section 0086 lines 12-14 “speech recognition and generate transcript” and Section 0090 lines 6-8 “voice activity detection)
generating a plurality of text strings by transcribing the speech segments with a speech recognition system; (Section 0029, lines 3-6 “generating transcript from recognizing speech from users speaking from the meeting” Section 0051 also talks about generating a transcript of the ad-hoc meeting)
determining a plurality of speaker identities associated with the plurality of text strings based on a speaker diarization model; (Section 0091, lines 1-3 speaker diarization modules receives and a third operation assigns a speaker ID which results in an assignment of a speaker label)
assigning respective indicators to the plurality of text strings based on the plurality of speaker identities, wherein text strings associated with one speaker are assigned to the same indicator; (Section 0091 lines 14-16 the speaker labels and speaker embeddings which recognized each word of the top SR hypothesis) and
generating a transcript by combining the plurality of text strings associated with the respective indicators, (Section 0136 lines 4-8 the transcript is generated prior to translating the transcript)
wherein the speaker diarization model is configured to utilize one or more diarization factors comprising audio channel data to determine the plurality of speaker identities. (Section 0121 audio and video channels are used to attribute speech to users for creating diarized transcript)
Qin fails to clearly disclose generating a voice segments by segmenting the audio streams.
Diamant discloses transcription of conference using a computerized intelligent assistant that generates a voice segments by segmenting the audio streams. (Section 0053, this section talk about a speech recognition machine that translates signal into text “Shall we play a game?” See the screenshot below)
PNG
media_image1.png
462
945
media_image1.png
Greyscale
Therefore it will be obvious to one ordinary skilled in the art before the effective
filing date of the claimed invention to include the teaching of generating segments based on speech recognition or voice detection. The motivation is that it makes analyzing voice activity effective.
Claim 2, Qin in view of Diamant discloses when the one or more diarization factors comprise speech feature vectors data, (Qin: embedding feature in Section 0091 lines 8-9) the computer-implemented method further comprising:
determining the distance between the speech feature vectors of a group of speech segments is below a threshold; (Qin: Section 0091, lines 11-14 “cosine similarity, negative Euclidian distance” ) and
clustering the group of speech segments by assigning the same indicator to the group of speech segments. (Qin: Section 0091, lines 8-10 agglomerative clustering)
Claim 3, Qin in view of Diamant discloses wherein the one or more diarization factors comprise acoustic beamforming data. (Qin: Section 0095, lines 1-3 agnostic beam forming)
Claim 5, , Qin in view of Diamant discloses further comprising timestamping the plurality of text strings according to a common clock;
storing the timestamps associated with the text strings; (Qin: Section 0100, Each utterance is assigned a universal timestamp, associated speaker, associated text and an associated audio segment)
receiving a request from an editing application to play audio corresponding to a text string; and playing audio beginning at the timestamp corresponding to the requested text string. (Qin: Section 0100, lines 3-9 the media in the chat are associated to the transcript inline through a timestamp)
Claim 6, , Qin in view of Diamant discloses further comprising displaying, on a screen, the transcript; and automatically scrolling through the plurality of text strings associated with audio streams being played. (Qin: Section 0135, the translated transcript is provided to a device for example a text displayed on a display, Also see Section 0043)
Claim 7, , Qin in view of Diamant discloses further comprising receiving video streams associated with the at least one audio source. (Qin: Section 0058, lines 11-13 a meeting assistance type of device in a conference room or video camera having a field of view of a meeting)
Claim 8, , Qin in view of Diamant discloses wherein the one or more diarization factors comprise speaker visual data. (Qin: Section 0121 audio and video channels are used to attribute speech to users for creating diarized transcript)
Claim 9, , Qin in view of Diamant discloses further comprising displaying, on a screen, video streams accompanying the audio streams capturing screenshots of video streams accompanying the audio streams; (Qin: video or still images described in Section 0108 reads on the captured screenshots) and
generating the transcript by combining the plurality of text strings associated with the respective indicators and the screenshots based on the timestamps. (Qin: Section 0100, lines 1-7 thus transcript inline through a timestamp to the whole meeting)
Claim 10, Qin in view of Diamant further comprising displaying, on the screen, the screenshots of video streams in a grid, wherein the screenshots of video streams are configured to associate with corresponding text strings and to represent the content of the video streams; (Qin: Section 0055, lines 5-7 performing image recognition on images from video signals. This means the captured video can be used as a video stream)
receiving a selection of a screenshot in the grid; displaying the corresponding text strings based on the selected screenshot; and playing audio associated with the corresponding text strings. (Diamant: Looking at figs, 1, 4 and 5 shows a screenshots of the users and it can be used for playbacks)
Claim 11, Qin in view of Diamant discloses further comprising:
capturing a plurality of screenshots of video streams with respective timestamps based on a common clock; (Qin, Section 0100 , lines 1-7 a picture of a whiteboard can be captured and uploaded at time t where time t is the timestamps)
generating a plurality of animated video files based on the plurality of screenshots; (Qin, Section 0118, a camera providing video of at least one of the users reads on animated video files)
displaying, on the screen, the plurality of animated videos in a grid, wherein the animated video are configured to associate with corresponding text strings and to represent the content of the video streams; (Diamant: Section 0061, Transcriptions includes information such as times each speech utterance)
PNG
media_image2.png
440
364
media_image2.png
Greyscale
Fig 10 shows a display of video of the speakers with the transcript of their corresponding text string. That represent the content of the video streams.
receiving a selection of an animated video file in the grid; playing the selected animated video file on the screen; playing audio associated with the selected animated video file; and (Qin, Section 0043 lines 3-4 streaming audio and/or video from a camera distributed to a meeting server reads on playing audio and video file)
displaying, on the screen, the corresponding text strings based on the selected animated video file. (Diamant: Section 0061 fig. 10 shows conference transcript that includes text attributed and the times of each speech utterance or the position of the speaker of each utterance)
Claim 12, Qin discloses a computer-implemented method for automatic conversation transcription, (Section 0001 conversation and generate speaker transcript) comprising:
receiving audio streams from at least one audio source; (Section 0031, lines 3-6 “receive audio streams” Fig. 5 section 510);
generating a plurality of text strings by transcribing the audio streams with a speech recognition system; (Section 0086 lines 12-14 “speech recognition and generate transcript” and Section 0090 lines 6-8 “voice activity detection)
determining a plurality of speaker identities associated with the plurality of text strings based on a speaker diarization model; (Section 0029, lines 3-6 “generating transcript from recognizing speech from users speaking from the meeting” Section 0051 also talks about generating a transcript of the ad-hoc meeting)
assigning respective indicators to the plurality of text strings based on the plurality of speaker identities, wherein text strings associated with one speaker are assigned to the same indicator; (Section 0091 lines 14-16 the speaker labels and speaker embeddings which recognized each word of the top SR hypothesis)
generating a transcript by combining the plurality of text strings associated with the respective indicators, (Section 0136 lines 4-8 the transcript is generated prior to translating the transcript)
wherein the speaker diarization model is configured to utilize one or more diarization factors comprising audio channel data to determine the plurality of speaker identities. (Section 0121 audio and video channels are used to attribute speech to users for creating diarized transcript)
Qin fails to clearly disclose generating a voice segments by segmenting the audio streams.
Diamant discloses transcription of conference using a computerized intelligent assistant that generates a voice segments by segmenting the audio streams. (Section 0053, this section talk about a speech recognition machine that translates signal into text “Shall we play a game?” See the screenshot below)
PNG
media_image1.png
462
945
media_image1.png
Greyscale
Therefore it will be obvious to one ordinary skilled in the art before the effective
filing date of the claimed invention to include the teaching of generating segments based on speech recognition or voice detection. The motivation is that it makes analyzing voice activity effective.
Claim 14, Qin in view of Diamant discloses further comprising:
generating speech segments by segmenting the audio streams, wherein the segmenting is based on voice activity detection. (Diamant: Section 0053, this section talk about a speech recognition machine that translates signal into text “Shall we play a game?” See the screenshot below)
timestamping the plurality of text strings according to a common clock. storing the timestamps associated with the text strings; receiving a request from an editing application to play audio corresponding to a text string; (Qin: Section 0100, Each utterance is assigned a universal timestamp, associated speaker, associated text and an associated audio segment)
and playing audio beginning at the timestamp corresponding to the requested text string.(Qin: Section 0100, lines 3-9 the media in the chat are associated to the transcript inline through a timestamp)
Claim 15, Qin in view of Diamant discloses further comprising displaying the transcript on a screen; and automatically scrolling through the plurality of text strings associated with audio streams being played. (Qin: Section 0135, the translated transcript is provided to a device for example a text displayed on a display, Also see Section 0043)
Claim 16, Qin in view of Diamant discloses further comprising receiving video streams associated with the at least one audio source. (Qin: video or still images described in Section 0108 reads on the captured screenshots)
Claim 17, Qin in view of Diamant discloses wherein the one or more diarization factors comprise speaker visual data. (Qin: Section 0121 audio and video channels are used to attribute speech to users for creating diarized transcript)
Claim 18, Qin in view of Diamant discloses comprising displaying, on a screen, video streams accompanying the audio streams; (Diamant: Fig. 10 shows audio streams)
capturing screenshots of video streams accompanying the audio streams; (Qin: video or still images described in Section 0108 reads on the captured screenshots) and
generating the transcript by combining the plurality of text strings associated with the respective indicators and the screenshots based on the timestamps. (Qin: Section 0100, lines 1-7 thus transcript inline through a timestamp to the whole meeting)
Claim 19, Qin in view of Diamant discloses further comprising:
displaying, on the screen, the screenshots of video streams in a grid, wherein the screenshots of video streams are configured to associate with corresponding text strings and to represent the content of the video streams; (Qin: Section 0055, lines 5-7 performing image recognition on images from video signals. This means the captured video can be used as a video stream)
receiving a selection of a screenshot in the grid; displaying the corresponding text strings based on the selected screenshot; and playing audio associated with the corresponding text strings. (Diamant: Looking at figs, 1, 4 and 5 shows a screenshots of the users and it can be used for playbacks)
Claim 20, Qin in view of Diamant discloses further comprising:
capturing a plurality of screenshots of video streams with respective timestamps based on a common clock; (Qin, Section 0100 , lines 1-7 a picture of a whiteboard can be captured and uploaded at time t where time t is the timestamps)
generating a plurality of animated video files based on the plurality of screenshots; (Qin, Section 0118, a camera providing video of at least one of the users reads on animated video files)
and displaying, on the screen, the plurality of animated videos in a grid, wherein the animated video are configured to associate with corresponding text strings and to represent the content of the video streams; (Diamant: Section 0061, Transcriptions includes information such as times each speech utterance)
PNG
media_image2.png
440
364
media_image2.png
Greyscale
Fig 10 shows a display of video of the speakers with the transcript of their corresponding text string. That represent the content of the video streams.
receiving a selection of an animated video file in the grid; playing the selected animated video file on the screen; playing audio associated with the selected animated video file; (Qin, Section 0043 lines 3-4 streaming audio and/or video from a camera distributed to a meeting server reads on playing audio and video file)
and displaying, on the screen, the corresponding text strings based on the selected animated video file. (Diamant: Section 0061 fig. 10 shows conference transcript that includes text attributed and the times of each speech utterance or the position of the speaker of each utterance).
Claim(s) 4 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Qin 20200349953 in view of Diamant (20190341050) as applied to claim 1-3, 5-12 and 14-20 above, and further in view of Gauci (20170011740)
Claim 4, Qin in view of Diamant discloses further comprising speech segment (See Diamant Section 0053)
Qin in view of Diamant fails to disclose embedding hyperlinks within the plurality of text strings, wherein the hyperlinks are associated with corresponding speech segments of the audio streams; and
enabling, by receiving a selected hyperlink associated with a speech segment, a playback of relevant audio streams.
Gauci discloses embedding hyperlinks within the plurality of text strings, wherein the hyperlinks are associated with corresponding speech segments of the audio streams; (Section 0017 lines 19-23 during transcription certain portions of text may be replaced with hyperlinks or references associated with the text ( e.g Maps, phone number and web elements) and
enabling, by receiving a selected hyperlink associated with a speech segment, a playback of relevant audio streams. (Section 0017, certain portion of the text is replaced with hyperlinks and therefore reads on selected hyperlink)
Therefore it will be obvious to one ordinary skilled in the art before the effective
filing date of the claimed invention to include the teaching of generating segments based on speech recognition or voice detection. The motivation is that it makes analyzing voice activity effective.
Claim 13, Qin in view of Diamant and further in view of Gauci discloses further comprising:
generating speech segments by segmenting the audio streams, wherein the segmenting is based on voice activity detection; (Diamant: Section 0053, this section talk about a speech recognition machine that translates signal into text “Shall we play a game?” See the screenshot below)
embedding hyperlinks within the plurality of text strings, wherein the hyperlinks are associated with corresponding speech segments of the audio streams; (Gauci: Section 0017 lines 19-23 during transcription certain portions of text may be replaced with hyperlinks or references associated with the text ( e.g Maps, phone number and web elements) and
enabling, by receiving a selected hyperlink associated with a speech segment, a playback of relevant audio streams. (Gauci: Section 0017, certain portion of the text is replaced with hyperlinks and therefore reads on selected hyperlink)
Cited Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Trim discloses a conference call system can determine speaker fluency level of a particular language (e.g., first language) using various machine learning techniques contemplated herein (e.g., RNN, LSTM, and CNN). In these embodiments, conference call system can observe one or more speaker attributes in voice data of speaker during the conference call. Conference call system can further analyze the one or more speaker attributes. Analyzing, using various machine learning techniques can allow conference call system to properly determine whether the one or more speaker attributes observed are indicative of a high fluency level, a low fluency level, or somewhere in between a low and high fluency level. In embodiments, conference call system can compare the one or more speaker attributes of speaker to a historical repository of speaker attributes. By comparing the one or more speaker attributes to a historical repository of speaker attributes, can allow conference call system to properly determine the fluency level associated with known speaker attributes
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Akwasi M Sarpong whose telephone number is (571) 270-3438. The examiner can normally be reached Mon-Fri. 8:00am-4:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner can be reached at . The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/AKWASI M SARPONG/ SPE, Art Unit 2681 1/31/2026