Prosecution Insights
Last updated: April 19, 2026
Application No. 18/743,562

METHOD AND SYSTEM FOR CONVERSATION TRANSCRIPTION WITH METADATA

Non-Final OA §103
Filed
Jun 14, 2024
Examiner
SARPONG, AKWASI
Art Unit
2681
Tech Center
2600 — Communications
Assignee
Soundhound AI Ip LLC
OA Round
1 (Non-Final)
68%
Grant Probability
Favorable
1-2
OA Rounds
3y 11m
To Grant
97%
With Interview

Examiner Intelligence

Grants 68% — above average
68%
Career Allow Rate
328 granted / 481 resolved
+6.2% vs TC avg
Strong +29% interview lift
Without
With
+28.9%
Interview Lift
resolved cases with interview
Typical timeline
3y 11m
Avg Prosecution
10 currently pending
Career history
491
Total Applications
across all art units

Statute-Specific Performance

§101
10.9%
-29.1% vs TC avg
§103
67.1%
+27.1% vs TC avg
§102
7.4%
-32.6% vs TC avg
§112
11.5%
-28.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 481 resolved cases

Office Action

§103
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1-3 and 5—12 and 14-20 are rejected under 35 U.S.C. 103 as being unpatentable over Qin 20200349953 in view of Diamant (20190341050) Qin discloses a computer-implemented method for automatic conversation transcription, (Section 0001 conversation and generate speaker transcript) comprising: receiving audio streams from at least one audio source (Section 0031, lines 3-6 “receive audio streams” Fig. 5 section 510); generating speech based on voice activity detection; (Section 0086 lines 12-14 “speech recognition and generate transcript” and Section 0090 lines 6-8 “voice activity detection) generating a plurality of text strings by transcribing the speech segments with a speech recognition system; (Section 0029, lines 3-6 “generating transcript from recognizing speech from users speaking from the meeting” Section 0051 also talks about generating a transcript of the ad-hoc meeting) determining a plurality of speaker identities associated with the plurality of text strings based on a speaker diarization model; (Section 0091, lines 1-3 speaker diarization modules receives and a third operation assigns a speaker ID which results in an assignment of a speaker label) assigning respective indicators to the plurality of text strings based on the plurality of speaker identities, wherein text strings associated with one speaker are assigned to the same indicator; (Section 0091 lines 14-16 the speaker labels and speaker embeddings which recognized each word of the top SR hypothesis) and generating a transcript by combining the plurality of text strings associated with the respective indicators, (Section 0136 lines 4-8 the transcript is generated prior to translating the transcript) wherein the speaker diarization model is configured to utilize one or more diarization factors comprising audio channel data to determine the plurality of speaker identities. (Section 0121 audio and video channels are used to attribute speech to users for creating diarized transcript) Qin fails to clearly disclose generating a voice segments by segmenting the audio streams. Diamant discloses transcription of conference using a computerized intelligent assistant that generates a voice segments by segmenting the audio streams. (Section 0053, this section talk about a speech recognition machine that translates signal into text “Shall we play a game?” See the screenshot below) PNG media_image1.png 462 945 media_image1.png Greyscale Therefore it will be obvious to one ordinary skilled in the art before the effective filing date of the claimed invention to include the teaching of generating segments based on speech recognition or voice detection. The motivation is that it makes analyzing voice activity effective. Claim 2, Qin in view of Diamant discloses when the one or more diarization factors comprise speech feature vectors data, (Qin: embedding feature in Section 0091 lines 8-9) the computer-implemented method further comprising: determining the distance between the speech feature vectors of a group of speech segments is below a threshold; (Qin: Section 0091, lines 11-14 “cosine similarity, negative Euclidian distance” ) and clustering the group of speech segments by assigning the same indicator to the group of speech segments. (Qin: Section 0091, lines 8-10 agglomerative clustering) Claim 3, Qin in view of Diamant discloses wherein the one or more diarization factors comprise acoustic beamforming data. (Qin: Section 0095, lines 1-3 agnostic beam forming) Claim 5, , Qin in view of Diamant discloses further comprising timestamping the plurality of text strings according to a common clock; storing the timestamps associated with the text strings; (Qin: Section 0100, Each utterance is assigned a universal timestamp, associated speaker, associated text and an associated audio segment) receiving a request from an editing application to play audio corresponding to a text string; and playing audio beginning at the timestamp corresponding to the requested text string. (Qin: Section 0100, lines 3-9 the media in the chat are associated to the transcript inline through a timestamp) Claim 6, , Qin in view of Diamant discloses further comprising displaying, on a screen, the transcript; and automatically scrolling through the plurality of text strings associated with audio streams being played. (Qin: Section 0135, the translated transcript is provided to a device for example a text displayed on a display, Also see Section 0043) Claim 7, , Qin in view of Diamant discloses further comprising receiving video streams associated with the at least one audio source. (Qin: Section 0058, lines 11-13 a meeting assistance type of device in a conference room or video camera having a field of view of a meeting) Claim 8, , Qin in view of Diamant discloses wherein the one or more diarization factors comprise speaker visual data. (Qin: Section 0121 audio and video channels are used to attribute speech to users for creating diarized transcript) Claim 9, , Qin in view of Diamant discloses further comprising displaying, on a screen, video streams accompanying the audio streams capturing screenshots of video streams accompanying the audio streams; (Qin: video or still images described in Section 0108 reads on the captured screenshots) and generating the transcript by combining the plurality of text strings associated with the respective indicators and the screenshots based on the timestamps. (Qin: Section 0100, lines 1-7 thus transcript inline through a timestamp to the whole meeting) Claim 10, Qin in view of Diamant further comprising displaying, on the screen, the screenshots of video streams in a grid, wherein the screenshots of video streams are configured to associate with corresponding text strings and to represent the content of the video streams; (Qin: Section 0055, lines 5-7 performing image recognition on images from video signals. This means the captured video can be used as a video stream) receiving a selection of a screenshot in the grid; displaying the corresponding text strings based on the selected screenshot; and playing audio associated with the corresponding text strings. (Diamant: Looking at figs, 1, 4 and 5 shows a screenshots of the users and it can be used for playbacks) Claim 11, Qin in view of Diamant discloses further comprising: capturing a plurality of screenshots of video streams with respective timestamps based on a common clock; (Qin, Section 0100 , lines 1-7 a picture of a whiteboard can be captured and uploaded at time t where time t is the timestamps) generating a plurality of animated video files based on the plurality of screenshots; (Qin, Section 0118, a camera providing video of at least one of the users reads on animated video files) displaying, on the screen, the plurality of animated videos in a grid, wherein the animated video are configured to associate with corresponding text strings and to represent the content of the video streams; (Diamant: Section 0061, Transcriptions includes information such as times each speech utterance) PNG media_image2.png 440 364 media_image2.png Greyscale Fig 10 shows a display of video of the speakers with the transcript of their corresponding text string. That represent the content of the video streams. receiving a selection of an animated video file in the grid; playing the selected animated video file on the screen; playing audio associated with the selected animated video file; and (Qin, Section 0043 lines 3-4 streaming audio and/or video from a camera distributed to a meeting server reads on playing audio and video file) displaying, on the screen, the corresponding text strings based on the selected animated video file. (Diamant: Section 0061 fig. 10 shows conference transcript that includes text attributed and the times of each speech utterance or the position of the speaker of each utterance) Claim 12, Qin discloses a computer-implemented method for automatic conversation transcription, (Section 0001 conversation and generate speaker transcript) comprising: receiving audio streams from at least one audio source; (Section 0031, lines 3-6 “receive audio streams” Fig. 5 section 510); generating a plurality of text strings by transcribing the audio streams with a speech recognition system; (Section 0086 lines 12-14 “speech recognition and generate transcript” and Section 0090 lines 6-8 “voice activity detection) determining a plurality of speaker identities associated with the plurality of text strings based on a speaker diarization model; (Section 0029, lines 3-6 “generating transcript from recognizing speech from users speaking from the meeting” Section 0051 also talks about generating a transcript of the ad-hoc meeting) assigning respective indicators to the plurality of text strings based on the plurality of speaker identities, wherein text strings associated with one speaker are assigned to the same indicator; (Section 0091 lines 14-16 the speaker labels and speaker embeddings which recognized each word of the top SR hypothesis) generating a transcript by combining the plurality of text strings associated with the respective indicators, (Section 0136 lines 4-8 the transcript is generated prior to translating the transcript) wherein the speaker diarization model is configured to utilize one or more diarization factors comprising audio channel data to determine the plurality of speaker identities. (Section 0121 audio and video channels are used to attribute speech to users for creating diarized transcript) Qin fails to clearly disclose generating a voice segments by segmenting the audio streams. Diamant discloses transcription of conference using a computerized intelligent assistant that generates a voice segments by segmenting the audio streams. (Section 0053, this section talk about a speech recognition machine that translates signal into text “Shall we play a game?” See the screenshot below) PNG media_image1.png 462 945 media_image1.png Greyscale Therefore it will be obvious to one ordinary skilled in the art before the effective filing date of the claimed invention to include the teaching of generating segments based on speech recognition or voice detection. The motivation is that it makes analyzing voice activity effective. Claim 14, Qin in view of Diamant discloses further comprising: generating speech segments by segmenting the audio streams, wherein the segmenting is based on voice activity detection. (Diamant: Section 0053, this section talk about a speech recognition machine that translates signal into text “Shall we play a game?” See the screenshot below) timestamping the plurality of text strings according to a common clock. storing the timestamps associated with the text strings; receiving a request from an editing application to play audio corresponding to a text string; (Qin: Section 0100, Each utterance is assigned a universal timestamp, associated speaker, associated text and an associated audio segment) and playing audio beginning at the timestamp corresponding to the requested text string.(Qin: Section 0100, lines 3-9 the media in the chat are associated to the transcript inline through a timestamp) Claim 15, Qin in view of Diamant discloses further comprising displaying the transcript on a screen; and automatically scrolling through the plurality of text strings associated with audio streams being played. (Qin: Section 0135, the translated transcript is provided to a device for example a text displayed on a display, Also see Section 0043) Claim 16, Qin in view of Diamant discloses further comprising receiving video streams associated with the at least one audio source. (Qin: video or still images described in Section 0108 reads on the captured screenshots) Claim 17, Qin in view of Diamant discloses wherein the one or more diarization factors comprise speaker visual data. (Qin: Section 0121 audio and video channels are used to attribute speech to users for creating diarized transcript) Claim 18, Qin in view of Diamant discloses comprising displaying, on a screen, video streams accompanying the audio streams; (Diamant: Fig. 10 shows audio streams) capturing screenshots of video streams accompanying the audio streams; (Qin: video or still images described in Section 0108 reads on the captured screenshots) and generating the transcript by combining the plurality of text strings associated with the respective indicators and the screenshots based on the timestamps. (Qin: Section 0100, lines 1-7 thus transcript inline through a timestamp to the whole meeting) Claim 19, Qin in view of Diamant discloses further comprising: displaying, on the screen, the screenshots of video streams in a grid, wherein the screenshots of video streams are configured to associate with corresponding text strings and to represent the content of the video streams; (Qin: Section 0055, lines 5-7 performing image recognition on images from video signals. This means the captured video can be used as a video stream) receiving a selection of a screenshot in the grid; displaying the corresponding text strings based on the selected screenshot; and playing audio associated with the corresponding text strings. (Diamant: Looking at figs, 1, 4 and 5 shows a screenshots of the users and it can be used for playbacks) Claim 20, Qin in view of Diamant discloses further comprising: capturing a plurality of screenshots of video streams with respective timestamps based on a common clock; (Qin, Section 0100 , lines 1-7 a picture of a whiteboard can be captured and uploaded at time t where time t is the timestamps) generating a plurality of animated video files based on the plurality of screenshots; (Qin, Section 0118, a camera providing video of at least one of the users reads on animated video files) and displaying, on the screen, the plurality of animated videos in a grid, wherein the animated video are configured to associate with corresponding text strings and to represent the content of the video streams; (Diamant: Section 0061, Transcriptions includes information such as times each speech utterance) PNG media_image2.png 440 364 media_image2.png Greyscale Fig 10 shows a display of video of the speakers with the transcript of their corresponding text string. That represent the content of the video streams. receiving a selection of an animated video file in the grid; playing the selected animated video file on the screen; playing audio associated with the selected animated video file; (Qin, Section 0043 lines 3-4 streaming audio and/or video from a camera distributed to a meeting server reads on playing audio and video file) and displaying, on the screen, the corresponding text strings based on the selected animated video file. (Diamant: Section 0061 fig. 10 shows conference transcript that includes text attributed and the times of each speech utterance or the position of the speaker of each utterance). Claim(s) 4 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Qin 20200349953 in view of Diamant (20190341050) as applied to claim 1-3, 5-12 and 14-20 above, and further in view of Gauci (20170011740) Claim 4, Qin in view of Diamant discloses further comprising speech segment (See Diamant Section 0053) Qin in view of Diamant fails to disclose embedding hyperlinks within the plurality of text strings, wherein the hyperlinks are associated with corresponding speech segments of the audio streams; and enabling, by receiving a selected hyperlink associated with a speech segment, a playback of relevant audio streams. Gauci discloses embedding hyperlinks within the plurality of text strings, wherein the hyperlinks are associated with corresponding speech segments of the audio streams; (Section 0017 lines 19-23 during transcription certain portions of text may be replaced with hyperlinks or references associated with the text ( e.g Maps, phone number and web elements) and enabling, by receiving a selected hyperlink associated with a speech segment, a playback of relevant audio streams. (Section 0017, certain portion of the text is replaced with hyperlinks and therefore reads on selected hyperlink) Therefore it will be obvious to one ordinary skilled in the art before the effective filing date of the claimed invention to include the teaching of generating segments based on speech recognition or voice detection. The motivation is that it makes analyzing voice activity effective. Claim 13, Qin in view of Diamant and further in view of Gauci discloses further comprising: generating speech segments by segmenting the audio streams, wherein the segmenting is based on voice activity detection; (Diamant: Section 0053, this section talk about a speech recognition machine that translates signal into text “Shall we play a game?” See the screenshot below) embedding hyperlinks within the plurality of text strings, wherein the hyperlinks are associated with corresponding speech segments of the audio streams; (Gauci: Section 0017 lines 19-23 during transcription certain portions of text may be replaced with hyperlinks or references associated with the text ( e.g Maps, phone number and web elements) and enabling, by receiving a selected hyperlink associated with a speech segment, a playback of relevant audio streams. (Gauci: Section 0017, certain portion of the text is replaced with hyperlinks and therefore reads on selected hyperlink) Cited Art The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Trim discloses a conference call system can determine speaker fluency level of a particular language (e.g., first language) using various machine learning techniques contemplated herein (e.g., RNN, LSTM, and CNN). In these embodiments, conference call system can observe one or more speaker attributes in voice data of speaker during the conference call. Conference call system can further analyze the one or more speaker attributes. Analyzing, using various machine learning techniques can allow conference call system to properly determine whether the one or more speaker attributes observed are indicative of a high fluency level, a low fluency level, or somewhere in between a low and high fluency level. In embodiments, conference call system can compare the one or more speaker attributes of speaker to a historical repository of speaker attributes. By comparing the one or more speaker attributes to a historical repository of speaker attributes, can allow conference call system to properly determine the fluency level associated with known speaker attributes Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to Akwasi M Sarpong whose telephone number is (571) 270-3438. The examiner can normally be reached Mon-Fri. 8:00am-4:00pm. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner can be reached at . The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /AKWASI M SARPONG/ SPE, Art Unit 2681 1/31/2026
Read full office action

Prosecution Timeline

Jun 14, 2024
Application Filed
Jan 31, 2026
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12475325
MODEL ROBUSTNESS ON OPERATORS AND TRIGGERING KEYWORDS IN NATURAL LANGUAGE TO A MEANING REPRESENTATION LANGUAGE SYSTEM
2y 5m to grant Granted Nov 18, 2025
Patent 12444215
METHOD AND SYSTEM FOR DETECTING AND EXTRACTING PRICE REGION FROM DIGITAL FLYERS AND PROMOTIONS
2y 5m to grant Granted Oct 14, 2025
Patent 11777874
ARTIFICIAL INTELLIGENCE CONVERSATION ENGINE
2y 5m to grant Granted Oct 03, 2023
Patent 11748613
SYSTEMS AND METHODS FOR LARGE SCALE SEMANTIC INDEXING WITH DEEP LEVEL-WISE EXTREME MULTI-LABEL LEARNING
2y 5m to grant Granted Sep 05, 2023
Patent 11735190
ATTENTIVE ADVERSARIAL DOMAIN-INVARIANT TRAINING
2y 5m to grant Granted Aug 22, 2023
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
68%
Grant Probability
97%
With Interview (+28.9%)
3y 11m
Median Time to Grant
Low
PTA Risk
Based on 481 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month