Last updated: April 19, 2026

Application No. 18/743,562

METHOD AND SYSTEM FOR CONVERSATION TRANSCRIPTION WITH METADATA

Non-Final OA §103

Filed

Jun 14, 2024

Examiner

SARPONG, AKWASI

Art Unit

2681

Tech Center

2600 — Communications

Assignee

Soundhound AI Ip LLC

OA Round

1 (Non-Final)

Interview Optional

— +28.9% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 481 resolved cases, 2023–2026

Examiner Intelligence

SARPONG, AKWASI View full profile →

Grants 68% — above average

Career Allow Rate

328 granted / 481 resolved

+6.2% vs TC avg

Strong +29% interview lift

Without

With

+28.9%

Interview Lift

resolved cases with interview

Typical timeline

3y 11m

Avg Prosecution

10 currently pending

Career history

491

Total Applications

across all art units

Statute-Specific Performance

§101

10.9%

-29.1% vs TC avg

§103

67.1%

+27.1% vs TC avg

§102

7.4%

-32.6% vs TC avg

§112

11.5%

-28.5% vs TC avg

Black line = Tech Center average estimate • Based on career data from 481 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-3 and 5—12 and 14-20 are rejected under 35 U.S.C. 103 as being unpatentable over Qin 20200349953 in view of Diamant (20190341050) 
            Qin discloses a computer-implemented method for automatic conversation transcription, (Section 0001 conversation and generate speaker transcript) comprising:
     receiving audio streams from at least one audio source (Section 0031, lines 3-6 “receive audio streams” Fig. 5 section 510); 
generating speech based on voice activity detection; (Section 0086 lines 12-14 “speech recognition and generate transcript” and Section 0090 lines 6-8 “voice activity detection)
generating a plurality of text strings by transcribing the speech segments with a speech recognition system; (Section 0029, lines 3-6 “generating transcript from recognizing speech from users speaking from the meeting” Section 0051 also talks about generating a transcript of the ad-hoc meeting) 
determining a plurality of speaker identities associated with the plurality of text strings based on a speaker diarization model; (Section 0091, lines 1-3 speaker diarization modules receives  and a third operation assigns a speaker ID which results in an assignment of a speaker label) 
assigning respective indicators to the plurality of text strings based on the plurality of speaker identities, wherein text strings associated with one speaker are assigned to the same indicator; (Section 0091 lines 14-16 the speaker labels and speaker embeddings which recognized each word of the top SR hypothesis) and
generating a transcript by combining the plurality of text strings associated with the respective indicators, (Section 0136 lines  4-8  the transcript is generated prior to translating the transcript)
wherein the speaker diarization model is configured to utilize one or more diarization factors comprising audio channel data to determine the plurality of speaker identities. (Section 0121 audio and video channels are used to attribute speech to users for creating diarized transcript) 
Qin fails to clearly disclose generating a voice segments by segmenting the audio streams.
Diamant discloses transcription of  conference using a computerized intelligent assistant that generates a voice segments by segmenting the audio streams. (Section 0053, this section talk about a speech recognition machine that translates signal into text “Shall we play a game?” See the screenshot below)


    PNG
    media_image1.png
    462
    945
    media_image1.png
    Greyscale

Therefore it will be obvious to one ordinary skilled in the art before the effective
filing date of the claimed invention to include the teaching of generating segments based on speech recognition or voice detection. The motivation is that it makes analyzing voice activity effective.

Claim 2, Qin in view of Diamant discloses when the one or more diarization factors comprise speech feature vectors data, (Qin: embedding feature in Section 0091 lines 8-9) the computer-implemented method further comprising:
determining the distance between the speech feature vectors of a group of speech segments is below a threshold; (Qin: Section 0091, lines 11-14 “cosine similarity, negative Euclidian distance” ) and
clustering the group of speech segments by assigning the same indicator to the group of speech segments. (Qin: Section 0091, lines 8-10 agglomerative clustering)
Claim 3, Qin in view of Diamant discloses wherein the one or more diarization factors comprise acoustic beamforming data. (Qin: Section 0095, lines 1-3 agnostic beam forming)  
Claim 5, , Qin in view of Diamant discloses further comprising timestamping the plurality of text strings according to a common clock;
storing the timestamps associated with the text strings; (Qin: Section 0100, Each utterance is assigned a universal timestamp, associated speaker, associated text and an associated audio segment) 
receiving a request from an editing application to play audio corresponding to a text string; and playing audio beginning at the timestamp corresponding to the requested text string. (Qin: Section 0100, lines 3-9 the media in the chat are associated to the transcript inline through a timestamp) 
Claim 6, , Qin in view of Diamant discloses further comprising displaying, on a screen, the transcript; and automatically scrolling through the plurality of text strings associated with audio streams being played. (Qin: Section 0135, the translated transcript is provided to a device for example a text displayed on a display, Also see Section 0043) 
Claim 7, , Qin in view of Diamant discloses further comprising receiving video streams associated with the at least one audio source. (Qin: Section 0058, lines 11-13 a meeting assistance type of device in a conference room or video camera having a field of view of a meeting) 
Claim 8, , Qin in view of Diamant discloses wherein the one or more diarization factors comprise speaker visual data. (Qin: Section 0121 audio and video channels are used to attribute speech to users for creating diarized transcript) 

Claim 9, , Qin in view of Diamant discloses further comprising displaying, on a screen, video streams accompanying the audio streams capturing screenshots of video streams accompanying the audio streams; (Qin: video or still images described in Section 0108 reads on the captured screenshots) and
generating the transcript by combining the plurality of text strings associated with the respective indicators and the screenshots based on the timestamps. (Qin: Section 0100, lines 1-7 thus transcript inline through a timestamp to the whole meeting) 
Claim 10, Qin in view of Diamant further comprising displaying, on the screen, the screenshots of video streams in a grid, wherein the screenshots of video streams are configured to associate with corresponding text strings and to represent the content of the video streams; (Qin: Section 0055, lines 5-7 performing image recognition on images from video signals. This means the captured video can be used as a video stream) 
receiving a selection of a screenshot in the grid; displaying the corresponding text strings based on the selected screenshot; and playing audio associated with the corresponding text strings. (Diamant: Looking at figs, 1, 4 and 5 shows a screenshots of the users and it can be used for playbacks)  
Claim 11, Qin in view of Diamant discloses further comprising:
capturing a plurality of screenshots of video streams with respective timestamps based on a common clock; (Qin, Section 0100 , lines 1-7 a picture of a whiteboard can be captured and uploaded at time t where time t is the timestamps)
generating a plurality of animated video files based on the plurality of screenshots; (Qin, Section 0118, a camera providing video of at least one of the users reads on animated video files) 
displaying, on the screen, the plurality of animated videos in a grid, wherein the animated video are configured to associate with corresponding text strings and to represent the content of the video streams; (Diamant: Section 0061, Transcriptions includes information such as times each speech utterance) 

    PNG
    media_image2.png
    440
    364
    media_image2.png
    Greyscale

Fig 10 shows a display of video of the speakers with the transcript of their corresponding text string. That represent the content of the video streams. 


receiving a selection of an animated video file in the grid; playing the selected animated video file on the screen; playing audio associated with the selected animated video file; and (Qin, Section 0043 lines 3-4 streaming audio and/or video from a camera distributed to a meeting server reads on playing audio and video file) 

displaying, on the screen, the corresponding text strings based on the selected animated video file. (Diamant: Section 0061 fig. 10 shows conference transcript that includes text attributed and the times of each speech utterance or the position of the speaker of each utterance) 
Claim 12, Qin discloses a  computer-implemented method for automatic conversation transcription, (Section 0001 conversation and generate speaker transcript) comprising:
receiving audio streams from at least one audio source; (Section 0031, lines 3-6 “receive audio streams” Fig. 5 section 510); 
generating a plurality of text strings by transcribing the audio streams with a speech recognition system; (Section 0086 lines 12-14 “speech recognition and generate transcript” and Section 0090 lines 6-8 “voice activity detection)

determining a plurality of speaker identities associated with the plurality of text strings based on a speaker diarization model; (Section 0029, lines 3-6 “generating transcript from recognizing speech from users speaking from the meeting” Section 0051 also talks about generating a transcript of the ad-hoc meeting) 

assigning respective indicators to the plurality of text strings based on the plurality of speaker identities, wherein text strings associated with one speaker are assigned to the same indicator; (Section 0091 lines 14-16 the speaker labels and speaker embeddings which recognized each word of the top SR hypothesis)
generating a transcript by combining the plurality of text strings associated with the respective indicators, (Section 0136 lines  4-8  the transcript is generated prior to translating the transcript)
 wherein the speaker diarization model is configured to utilize one or more diarization factors comprising audio channel data to determine the plurality of speaker identities. (Section 0121 audio and video channels are used to attribute speech to users for creating diarized transcript) 
Qin fails to clearly disclose generating a voice segments by segmenting the audio streams.
Diamant discloses transcription of  conference using a computerized intelligent assistant that generates a voice segments by segmenting the audio streams. (Section 0053, this section talk about a speech recognition machine that translates signal into text “Shall we play a game?” See the screenshot below)

    PNG
    media_image1.png
    462
    945
    media_image1.png
    Greyscale

Therefore it will be obvious to one ordinary skilled in the art before the effective
filing date of the claimed invention to include the teaching of generating segments based on speech recognition or voice detection. The motivation is that it makes analyzing voice activity effective.
Claim 14, Qin in view of Diamant discloses further comprising:
generating speech segments by segmenting the audio streams, wherein the segmenting is based on voice activity detection. (Diamant: Section 0053, this section talk about a speech recognition machine that translates signal into text “Shall we play a game?” See the screenshot below)
timestamping the plurality of text strings according to a common clock. storing the timestamps associated with the text strings; receiving a request from an editing application to play audio corresponding to a text string; (Qin: Section 0100, Each utterance is assigned a universal timestamp, associated speaker, associated text and an associated audio segment) 
and playing audio beginning at the timestamp corresponding to the requested text string.(Qin: Section 0100, lines 3-9 the media in the chat are associated to the transcript inline through a timestamp) 

Claim 15, Qin in view of Diamant discloses further comprising displaying the transcript on a screen; and automatically scrolling through the plurality of text strings associated with audio streams being played. (Qin: Section 0135, the translated transcript is provided to a device for example a text displayed on a display, Also see Section 0043) 

Claim 16, Qin in view of Diamant discloses further comprising receiving video streams associated with the at least one audio source. (Qin: video or still images described in Section 0108 reads on the captured screenshots)

Claim 17, Qin in view of Diamant discloses  wherein the one or more diarization factors comprise speaker visual data. (Qin: Section 0121 audio and video channels are used to attribute speech to users for creating diarized transcript) 

Claim 18, Qin in view of Diamant discloses comprising displaying, on a screen, video streams accompanying the audio streams; (Diamant: Fig. 10 shows audio streams) 
capturing screenshots of video streams accompanying the audio streams; (Qin: video or still images described in Section 0108 reads on the captured screenshots) and
generating the transcript by combining the plurality of text strings associated with the respective indicators and the screenshots based on the timestamps. (Qin: Section 0100, lines 1-7 thus transcript inline through a timestamp to the whole meeting) 
Claim 19, Qin in view of Diamant discloses further comprising:
displaying, on the screen, the screenshots of video streams in a grid, wherein the screenshots of video streams are configured to associate with corresponding text strings and to represent the content of the video streams; (Qin: Section 0055, lines 5-7 performing image recognition on images from video signals. This means the captured video can be used as a video stream) 
receiving a selection of a screenshot in the grid; displaying the corresponding text strings based on the selected screenshot; and playing audio associated with the corresponding text strings. (Diamant: Looking at figs, 1, 4 and 5 shows a screenshots of the users and it can be used for playbacks)  

Claim 20, Qin in view of Diamant discloses  further comprising:
capturing a plurality of screenshots of video streams with respective timestamps based on a common clock; (Qin, Section 0100 , lines 1-7 a picture of a whiteboard can be captured and uploaded at time t where time t is the timestamps)

generating a plurality of animated video files based on the plurality of screenshots; (Qin, Section 0118, a camera providing video of at least one of the users reads on animated video files) 
 and displaying, on the screen, the plurality of animated videos in a grid, wherein the animated video are configured to associate with corresponding text strings and to represent the content of the video streams; (Diamant: Section 0061, Transcriptions includes information such as times each speech utterance) 

    PNG
    media_image2.png
    440
    364
    media_image2.png
    Greyscale

Fig 10 shows a display of video of the speakers with the transcript of their corresponding text string. That represent the content of the video streams.


receiving a selection of an animated video file in the grid; playing the selected animated video file on the screen; playing audio associated with the selected animated video file; (Qin, Section 0043 lines 3-4 streaming audio and/or video from a camera distributed to a meeting server reads on playing audio and video file) 
and displaying, on the screen, the corresponding text strings based on the selected animated video file. (Diamant: Section 0061 fig. 10 shows conference transcript that includes text attributed and the times of each speech utterance or the position of the speaker of each utterance).
Claim(s) 4 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Qin 20200349953 in view of Diamant (20190341050) as applied to claim 1-3, 5-12 and 14-20 above, and further in view of Gauci (20170011740)
Claim 4, Qin in view of Diamant discloses further comprising speech segment (See Diamant Section 0053)
Qin in view of Diamant fails to disclose embedding hyperlinks within the plurality of text strings, wherein the hyperlinks are associated with corresponding speech segments of the audio streams; and
enabling, by receiving a selected hyperlink associated with a speech segment, a playback of relevant audio streams. 
Gauci discloses embedding hyperlinks within the plurality of text strings, wherein the hyperlinks are associated with corresponding speech segments of the audio streams; (Section 0017 lines 19-23 during transcription certain portions of text may be replaced with hyperlinks or references associated with the text ( e.g Maps, phone number and web elements) and
enabling, by receiving a selected hyperlink associated with a speech segment, a playback of relevant audio streams. (Section 0017, certain portion of the text is replaced with hyperlinks and therefore reads on selected hyperlink) 
Therefore it will be obvious to one ordinary skilled in the art before the effective
filing date of the claimed invention to include the teaching of generating segments based on speech recognition or voice detection. The motivation is that it makes analyzing voice activity effective.

Claim 13, Qin in view of Diamant and further in view of Gauci  discloses further comprising:
generating speech segments by segmenting the audio streams, wherein the segmenting is based on voice activity detection; (Diamant: Section 0053, this section talk about a speech recognition machine that translates signal into text “Shall we play a game?” See the screenshot below)

embedding hyperlinks within the plurality of text strings, wherein the hyperlinks are associated with corresponding speech segments of the audio streams; (Gauci: Section 0017 lines 19-23 during transcription certain portions of text may be replaced with hyperlinks or references associated with the text ( e.g Maps, phone number and web elements) and
enabling, by receiving a selected hyperlink associated with a speech segment, a playback of relevant audio streams. (Gauci: Section 0017, certain portion of the text is replaced with hyperlinks and therefore reads on selected hyperlink) 

	Cited Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Trim discloses a conference call system can determine speaker fluency level of a particular language (e.g., first language) using various machine learning techniques contemplated herein (e.g., RNN, LSTM, and CNN). In these embodiments, conference call system  can observe one or more speaker attributes in voice data  of speaker  during the conference call. Conference call system  can further analyze the one or more speaker attributes. Analyzing, using various machine learning techniques can allow conference call system  to properly determine whether the one or more speaker attributes observed are indicative of a high fluency level, a low fluency level, or somewhere in between a low and high fluency level. In embodiments, conference call system  can compare the one or more speaker attributes of speaker to a historical repository of speaker attributes. By comparing the one or more speaker attributes to a historical repository of speaker attributes, can allow conference call system  to properly determine the fluency level associated with known speaker attributes


	Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Akwasi M Sarpong whose telephone number is (571) 270-3438. The examiner can normally be reached Mon-Fri. 8:00am-4:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner can be reached at . The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/AKWASI M SARPONG/           SPE, Art Unit 2681                                                                                                                                                                                                          1/31/2026

Read full office action

Prosecution Timeline

Jun 14, 2024

Application Filed

Jan 31, 2026

Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/505,498

Patent 12475325

MODEL ROBUSTNESS ON OPERATORS AND TRIGGERING KEYWORDS IN NATURAL LANGUAGE TO A MEANING REPRESENTATION LANGUAGE SYSTEM

2y 5m to grant Granted Nov 18, 2025

18/183,411

Patent 12444215

METHOD AND SYSTEM FOR DETECTING AND EXTRACTING PRICE REGION FROM DIGITAL FLYERS AND PROMOTIONS

2y 5m to grant Granted Oct 14, 2025

16/711,902

Patent 11777874

ARTIFICIAL INTELLIGENCE CONVERSATION ENGINE

2y 5m to grant Granted Oct 03, 2023

16/409,148

Patent 11748613

SYSTEMS AND METHODS FOR LARGE SCALE SEMANTIC INDEXING WITH DEEP LEVEL-WISE EXTREME MULTI-LABEL LEARNING

2y 5m to grant Granted Sep 05, 2023

17/494,194

Patent 11735190

ATTENTIVE ADVERSARIAL DOMAIN-INVARIANT TRAINING

2y 5m to grant Granted Aug 22, 2023

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

68%

Grant Probability

97%

With Interview (+28.9%)

3y 11m

Median Time to Grant

Low

PTA Risk

Based on 481 resolved cases by this examiner. Grant probability derived from career allow rate.