Last updated: May 29, 2026
Application No. 18/711,020
CONFERENCE CONTENT DISPLAY METHOD, CONFERENCE SYSTEM AND CONFERENCE DEVICE

Final Rejection §103
Filed
May 16, 2024
Priority
Nov 19, 2021 — nonprovisional of PCTCN2021131943
Examiner
THOMAS-HOMESCU, ANNE L
Art Unit
2656
Tech Center
2600 — Communications
Assignee
BOE TECHNOLOGY GROUP CO., LTD.
OA Round
2 (Final)
Interview Optional

— +35.8% interview lift. Examiner has a relatively high allowance rate (77%); +35.8% interview lift. A written response may suffice.
Based on 369 resolved cases, 2023–2026
Examiner Intelligence

THOMAS-HOMESCU, ANNE L View full profile →
Grants 77% — above average
Career Allowance Rate
284 granted / 369 resolved
+15.0% vs TC avg
Strong +36% interview lift
Without
With
+35.8%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
18 currently pending
Career history
395
Total Applications
across all art units
Statute-Specific Performance

§101
4.8%
-35.2% vs TC avg
§103
89.1%
+49.1% vs TC avg
§102
4.9%
-35.1% vs TC avg
§112
0.7%
-39.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 369 resolved cases
Office Action

§103
DETAILED ACTION

This communication is in response to the Amendments and Arguments filed on 24 February 2026. Claims 33-35, 37-47, and 49-52 are pending and have been examined. The Applicants’ amendment and remarks have been carefully considered, but are not persuasive. Hence, this Action has been made FINAL. 
All previous objections and rejections directed to the Applicant’s disclosure and claims not discussed in this Office Action have been withdrawn by the Examiner.

Response to Amendments and Arguments

The Applicant’s arguments have been considered but are moot because the new ground of rejection (Marti et al.) does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 33-34, 38, 41-42, 44-45, 50, and 52 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20130144603, hereinafter referred to as Lord et al., in view of US 20220246168, hereinafter referred to as Marti et al.

Claims 1-32 cancelled. 

Regarding claim 33 (Currently Amended), Lord et al. discloses a method for displaying a conference content, applied to a conference device (“The conferencing device 120a includes a display 121 that is configured to present text and/or graphics,” Lord et al., para [0021].), wherein the method comprises:

determining a voice text corresponding to voice information collected by a terminal of a participating user (“The speech recognizer 212 transforms speech audio data received (e.g., from the conferencing device 120) into textual representation of an utterance represented by the speech audio data,” Lord et al., para [0053]. And, “At block 3.5204, the process performs informing the user of an identifier of a speaker along with a transcription of a previous utterance made by the speaker. As shown in FIG. 1C, a transcript may include a speaker's name displayed next to an utterance from that speaker,” Lord et al., para [0178].); and

displaying the conference content related to the voice text (“Informing the user of the conference history information may also or instead include visually presenting the information, such as via the display 121 of the conferencing device 120a. In the illustrated example, the AEFS 100 causes a message 112 that includes a portion of a transcript of the voice conference to be displayed on the display 121,” Lord et al., para [0032].).

Lord et al., though, does not disclose wherein, in a case where only one of multiple participating users speaks, multiple pieces of voice information collected by multiple user terminals corresponding to the multiple participating users are voice content of the one participating user, and voice information with a largest volume among the multiple pieces of voice information is taken as voice information whose volume satisfies a condition;

wherein the voice text is determined according to the voice information whose volume satisfies the condition among the multiple pieces of voice information;

in a case where two or more participating users speak simultaneously, for each of the two or more participating users, among multiple pieces of voice information collected by a user terminal corresponding to the participating user, voice information with a largest volume is taken as voice information whose volume satisfies a condition; and

wherein the voice text is determined according to the voice information whose volume satisfies the condition among the multiple pieces of voice information.

Marti et al. is cited to disclose wherein, in a case where only one of multiple participating users speaks, multiple pieces of voice information collected by multiple user terminals corresponding to the multiple participating users are voice content of the one participating user (“For example, I/O device(s) 108 could include a directional microphone array that forms a steerable beam directed towards one or more participants of media content stream 150 (e.g., speaker 180 or the like). In such instances, I/O device(s) 108 could acquire a speech portion from one or more participants of media content stream 150 (e.g., speaker 180 or the like) without acquiring speech from other participants (e.g., user 160 or the like),” Marti et al., para [0030]. The I/O devices are multiple user terminals corresponding to multiple participating users. And, “In step 202, audio detection module 130 determines what is said by one or more one or more participants of media content stream 150 (e.g., speaker 180 or the like). In some embodiments, audio detection module 130 detects an audio signal associated with one or more participants of media content stream 150 (e.g., speaker 180 or the like),” Marti et al., para [0060].), and voice information with a largest volume among the multiple pieces of voice information is taken as voice information whose volume satisfies a condition (“In some embodiments, audio detection module 130 uses voice isolation module 131 to detect a dominant voice (e.g., loudest voice) associated with one or more participants of media content stream 150 (e.g., speaker 180 or the like), and cancels out background noise and other less dominant voices. In some embodiments, audio detection module 130 uses conversation finder module 132 to determine whether one or more voices and/or conversations in the environment surrounding or proximate to user 160, speaker 180, or the like are associated with media content stream 150,” Marti et al., para [0061]. Here, the volume condition satisfied is being the loudest voice.);

wherein the voice text is determined according to the voice information whose volume satisfies the condition among the multiple pieces of voice information (“Voice transcription module 133 uses one or more voice transcription techniques to generate a textual representation of each word contained in an audio signal associated media content stream 150. In some embodiments, voice transcription module 133 changes the audio signal received from I/O device(s) 108 to a different modality of content (e.g., speech to text),” Marti et al., para [0032]. Here, the dominant/loudest voice is isolated according to voice isolation module 131 and then transcribed using voice transcription module 133.); and

in a case where two or more participating users speak simultaneously, for each of the two or more participating users, among multiple pieces of voice information collected by a user terminal corresponding to the participating user, voice information with a largest volume is taken as voice information whose volume satisfies a condition (Marti et al., para [0061].); and
wherein the voice text is determined according to the voice information whose volume satisfies the condition among the multiple pieces of voice information (Marti et al., para [0032].). Marti et al. benefits Lord et al. by allowing participants to determine the explanation of domain-specific terminology while keeping track of a media content stream or media content stream being consumed, thereby providing participants with a practical way of performing the lookup of such terminology while also speaking or engaging in a real-time manner with the ongoing media content stream or while consuming the media content stream without interrupting the content (Marti et al., para [0005]-[0006]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Lord et al. with those of Marti et al. to enhance the voice conferencing system of Lord et al.

Regarding claim 34 (Previously Presented), Lord et al., as modified by Marti et al., discloses the method according to claim 33, wherein the determining of the voice text corresponding to the voice information collected by the terminal of the participating user, comprises: 

receiving the voice information collected by the terminal (“The AEFS 100 receives data representing a speech signal that represents the utterance 110, such as by receiving a digital representation of an audio signal transmitted by conferencing device 120b. The data representing the speech signal may include audio samples (e.g., raw audio data), compressed audio data, speech vectors (e.g., mel frequency cepstral coefficients), and/or any other data that may be used to represent an audio signal ,” Lord et al., para [0024].), performing voice recognition on the voice information (“The speech recognizer 212 transforms speech audio data received (e.g., from the conferencing device 120) into textual representation of an utterance represented by the speech audio data,” Lord et al., para [0053].), and determining the voice text corresponding to the voice information (“At block 3.5204, the process performs informing the user of an identifier of a speaker along with a transcription of a previous utterance made by the speaker. As shown in FIG. 1C, a transcript may include a speaker's name displayed next to an utterance from that speaker,” Lord et al., para [0178].).  

Regarding claim 38 (Previously Presented), Lord et al., as modified by Marti et al., discloses the method according to claim 35, wherein the receiving of the voice information collected by the terminal, comprises: 

establishing a communication connection with the terminal (“Each of the illustrated conferencing devices 120 includes or may be communicatively coupled to a microphone operable to receive a speech signal from a speaker,” Lord et al., para [0038]. And, “In some embodiments, the microphone may be a microphone of a conferencing device operated by a speaker,” Lord et al., para [0256]. Thus, the microphone collection the voice information is part of the user terminal.), and receiving the voice information collected by the terminal through streaming transmission (Lord et al., para [0038].).  

Regarding claim 39 (Previously Presented), Lord et al., as modified by Marti et al., discloses the method according to claim 33, wherein the voice text further comprises user information (Lord et al., fig. 1C.); 

wherein the user information is determined according to a voiceprint feature corresponding to the voice information; and the voiceprint feature is obtained by performing voiceprint recognition on the voice information (“The speaker recognizer 214 identifies the speaker based on acoustic properties of the speaker's voice, as reflected by the speech data received from the conferencing device 120. The speaker recognizer 214 may compare a speaker voice print to previously generated and recorded voice prints stored in the data store 240 in order to find a best or likely match. Voice prints or other signal properties may be determined with reference to voice mail messages, voice chat data, or some other corpus of speech data,” Lord et al., para [0054].); 

wherein after determining the voice text corresponding to the voice information collected by the terminal of the participating user, the method further comprises: 

generating a conference record according to the voice text (“At block 3.5204, the process performs informing the user of an identifier of a speaker along with a transcription of a previous utterance made by the speaker. As shown in FIG. 1C, a transcript may include a speaker's name displayed next to an utterance from that speaker,” Lord et al., para [0178]. The transcript is a conference record.); or, 

generating a conference record according to the voice text and user information corresponding to the voice text (“At block 3.5204, the process performs informing the user of an identifier of a speaker along with a transcription of a previous utterance made by the speaker. As shown in FIG. 1C, a transcript may include a speaker's name displayed next to an utterance from that speaker,” Lord et al., para [0178]. Here, the speaker identifier is the user information corresponding to the voice text (i.e., speaker’s portion of transcript).).  

Regarding claim 41 (Previously Presented), Lord et al., as modified by Marti et al., discloses the method according to claim 39, wherein after generating the conference record, the method further comprises: 

obtaining a voice file uploaded locally, and determining a supplementary voice text and a supplementary voiceprint feature corresponding to uploaded voice information in the voice file (“At block 3.2204, the process performs comparing properties of the speech signal with properties of previously recorded speech signals from multiple persons. In some embodiments, the process accesses voice prints associated with multiple persons, and determines a best match against the speech signal,” Lord et al., para [0112].); 

generating a supplementary conference record according to the supplementary voice text and supplementary user information corresponding to the supplementary voiceprint feature (“At block 3.2004, the process performs performing voice identification based on the data representing the speech signals from the voice conference. In some embodiments, voice identification may include generating a voice print, voice model, or other biometric feature set that characterizes the voice of the speaker, and then comparing the generated voice print to previously generated voice prints,” Lord et al., para [0108]. Here, the voice conference is supplemented by the voiceprint feature.); and 

updating the conference record by using the supplementary conference record (“At block 3.5204, the process performs informing the user of an identifier of a speaker along with a transcription of a previous utterance made by the speaker. As shown in FIG. 1C, a transcript may include a speaker's name displayed next to an utterance from that speaker,” Lord et al., para [0178].). 

Regarding claim 42 (Previously Presented), Lord et al., as modified by Marti et al., discloses the method according to claim 33, wherein after determining the voice text corresponding to the voice information collected by the terminal of the participating user, the method further comprises: 

directly translating the voice text into a translated text corresponding to a preset language type (“The AEFS 100 may perform other services, including translating utterances made by speakers in the voice conference, so that a multi-lingual voice conference may be conducted even between participants who do not understand all of the languages being spoken…Determining the language may also or instead be based on contextual factors, such as GPS information indicating that the current speaker is in Germany, Austria, or some other region where German is commonly spoken,” Lord et al., para [0034]. Determining a language in advance of translation using location is a way to preset the language type.)

Regarding claim 44 (Currently Amended), Lord et al. discloses a conference system, comprising a user terminal and a conference device (Lord et al., fig. 1A.), wherein: 

the user terminal is configured to collect voice information (“The speech recognizer 212 transforms speech audio data received (e.g., from the conferencing device 120) into textual representation of an utterance represented by the speech audio data,” Lord et al., para [0053].); and 

the conference device is configured to: determine a voice text corresponding to the voice information collected by the user terminal (“At block 3.5204, the process performs informing the user of an identifier of a speaker along with a transcription of a previous utterance made by the speaker. As shown in FIG. 1C, a transcript may include a speaker's name displayed next to an utterance from that speaker,” Lord et al., para [0178].), and display a conference content related to the voice text (“Informing the user of the conference history information may also or instead include visually presenting the information, such as via the display 121 of the conferencing device 120a. In the illustrated example, the AEFS 100 causes a message 112 that includes a portion of a transcript of the voice conference to be displayed on the display 121,” Lord et al., para [0032].).  

Lord et al., though, does not disclose wherein, in a case where only one of multiple participating users speaks, multiple pieces of voice information collected by multiple user terminals corresponding to the multiple participating users are voice content of the one participating user, and voice information with a largest volume among the multiple pieces of voice information is taken as voice information whose volume satisfies a condition;

wherein the voice text is determined according to the voice information whose volume satisfies the condition among the multiple pieces of voice information;

in a case where two or more participating users speak simultaneously, for each of the two or more participating users, among multiple pieces of voice information collected by a user terminal corresponding to the participating user, voice information with a largest volume is taken as voice information whose volume satisfies a condition; and

wherein the voice text is determined according to the voice information whose volume satisfies the condition among the multiple pieces of voice information.

Marti et al. is cited to disclose wherein, in a case where only one of multiple participating users speaks, multiple pieces of voice information collected by multiple user terminals corresponding to the multiple participating users are voice content of the one participating user (“For example, I/O device(s) 108 could include a directional microphone array that forms a steerable beam directed towards one or more participants of media content stream 150 (e.g., speaker 180 or the like). In such instances, I/O device(s) 108 could acquire a speech portion from one or more participants of media content stream 150 (e.g., speaker 180 or the like) without acquiring speech from other participants (e.g., user 160 or the like),” Marti et al., para [0030]. The I/O devices are multiple user terminals corresponding to multiple participating users. And, “In step 202, audio detection module 130 determines what is said by one or more one or more participants of media content stream 150 (e.g., speaker 180 or the like). In some embodiments, audio detection module 130 detects an audio signal associated with one or more participants of media content stream 150 (e.g., speaker 180 or the like),” Marti et al., para [0060].), and voice information with a largest volume among the multiple pieces of voice information is taken as voice information whose volume satisfies a condition (“In some embodiments, audio detection module 130 uses voice isolation module 131 to detect a dominant voice (e.g., loudest voice) associated with one or more participants of media content stream 150 (e.g., speaker 180 or the like), and cancels out background noise and other less dominant voices. In some embodiments, audio detection module 130 uses conversation finder module 132 to determine whether one or more voices and/or conversations in the environment surrounding or proximate to user 160, speaker 180, or the like are associated with media content stream 150,” Marti et al., para [0061]. Here, the volume condition satisfied is being the loudest voice.);

wherein the voice text is determined according to the voice information whose volume satisfies the condition among the multiple pieces of voice information (“Voice transcription module 133 uses one or more voice transcription techniques to generate a textual representation of each word contained in an audio signal associated media content stream 150. In some embodiments, voice transcription module 133 changes the audio signal received from I/O device(s) 108 to a different modality of content (e.g., speech to text),” Marti et al., para [0032]. Here, the dominant/loudest voice is isolated according to voice isolation module 131 and then transcribed using voice transcription module 133.); and

in a case where two or more participating users speak simultaneously, for each of the two or more participating users, among multiple pieces of voice information collected by a user terminal corresponding to the participating user, voice information with a largest volume is taken as voice information whose volume satisfies a condition (Marti et al., para [0061].); and

wherein the voice text is determined according to the voice information whose volume satisfies the condition among the multiple pieces of voice information (Marti et al., para [0032].). Marti et al. benefits Lord et al. by allowing participants to determine the explanation of domain-specific terminology while keeping track of a media content stream or media content stream being consumed, thereby providing participants with a practical way of performing the lookup of such terminology while also speaking or engaging in a real-time manner with the ongoing media content stream or while consuming the media content stream without interrupting the content (Marti et al., para [0005]-[0006]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Lord et al. with those of Marti et al. to enhance the voice conferencing system of Lord et al.

Regarding claim 45 (Previously Presented), Lord et al., as modified by Marti et al., discloses the conference system according to claim 44, wherein, the user terminal is configured to send the collected voice information to the conference device (“The AEFS 100 and the conferencing devices 120 are communicatively coupled to one another via the communication system 150,” Lord et al., para [0022]. Here, conferencing devices 120 are interpreted as user terminals and the AEFS 100 and conferencing devices 120 are interpreted as conference devices.); and 

the conference device is configured to perform voice recognition on the voice information to obtain the voice text (“The natural language processor 216 processes text generated by the speech recognizer 212 and/or located in information items obtained from the speaker-related information sources 130. In doing so, the natural language processor 216 may identify relationships, events, or entities (e.g., people, places, things) that may facilitate speaker identification, language translation, and/or other functions of the AEFS 100,” Lord et al., para [0055].).  

Regarding claim 50 (Previously Presented), Lord et al., as modified by Marti et al., discloses the conference system according to claim 45, wherein the conference device is configured to: establish a communication connection with the user terminal (“Each of the illustrated conferencing devices 120 includes or may be communicatively coupled to a microphone operable to receive a speech signal from a speaker,” Lord et al., para [0038]. And, “In some embodiments, the microphone may be a microphone of a conferencing device operated by a speaker,” Lord et al., para [0256]. Thus, the microphone collection the voice information is part of the user terminal.), and receive the voice information collected by the user terminal through streaming transmission (Lord et al., para [0038].).  

Regarding claim 52 (Previously Presented), Lord et al., as modified by Marti et al., discloses the conference system according to claim 44, wherein, 

the conference device is configured to translate the voice text into a translated text corresponding to a preset language type (“The AEFS 100 may perform other services, including translating utterances made by speakers in the voice conference, so that a multi-lingual voice conference may be conducted even between participants who do not understand all of the languages being spoken…Determining the language may also or instead be based on contextual factors, such as GPS information indicating that the current speaker is in Germany, Austria, or some other region where German is commonly spoken,” Lord et al., para [0034]. Determining a language in advance of translation using location is a way to preset the language type.); or, 

wherein the conference device is further configured to display the conference content related to the voice text through at least one of: 

displaying the voice text in real time (“The speech and language engine 210 includes a speech recognizer 212, a speaker recognizer 214, a natural language processor 216, and a language translation processor 218. The speech recognizer 212 transforms speech audio data received (e.g., from the conferencing device 120) into textual representation of an utterance represented by the speech audio data,” Lord et al., para [0053].); 

displaying a user name corresponding to the voice text in real time (“The conferencing device 120a includes a display 121 that is configured to present text and/or graphics,” Lord et al., para [0022]. And, “For example, in response to the onset of an utterance of a speaker, the AEFS 100 may display the name of the speaker on the display 121, so that the user is always informed as to who is speaking,” Lord et al., para [0033].);

displaying a conference record related to the voice text (Lord et al., fig. 1C - transcript is a type of conference record.); 

displaying a translated text of a preset language type translated from the voice text in real time (“Having determined speaker-related information, the AEFS 100 may then translate an utterance in a first language into an utterance in a second language. In some embodiments, the AEFS 100 translates an utterance by first performing speech recognition to translate the utterance into a textual representation that includes a sequence of words in the first language. Then, the AEFS 100 may translate the text in the first language into a message in a second language, using machine translation techniques,” Lord et al., para [0035]. And, Lord et al., para [0034]. Determining a language in advance of translation using location is a way to preset the language type.); 

displaying a download link address corresponding to a conference record related to the voice text (“Note also that a link 147 is displayed that associates the term "agenda" in the transcript 141 with the agenda 142. As further examples, the terms "calendar" and "John's email" are respectively linked to the calendar 143 and the email 144,” Lord et al., para [0049]. The conference agenda is interpreted as conference record.)

.

Claim(s) 35-36 and 47-48 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20130144603, hereinafter referred to as Lord et al., in view of US 20220246168, hereinafter referred to as Marti et al., and further in view of US 8521525, hereinafter referred to as Fujino. 

Regarding claim 35 (Previously Presented), Lord et al., as modified by Marti et al., discloses the method according to claim 33, but not wherein the determining of the voice text corresponding to the voice information collected by the terminal of the participating user, comprises: 

receiving a voice text, and determining the received voice text as the voice text corresponding to the voice information; 

receiving the voice text sent by the terminal; 

the voice text sent by the terminal is obtained by: the terminal performing voice recognition on the voice information.  

Fujino is cited to disclose wherein the determining of the voice text corresponding to the voice information collected by the terminal of the participating user, comprises: 

receiving a voice text, and determining the received voice text as the voice text corresponding to the voice information (“When sound data in the predetermined frequency domain is detected from the acquired sound data, namely, when it is determined that the speaker's speech is detected (S11:YES), a volume level of the acquired sound data is measured (S12). That is to say, a volume level of the speaker's voice acquired from the microphone 140 is measured. Thereafter, it is determined whether the volume level is smaller than a threshold (S13). The threshold is a value that is set in a threshold storage area (not shown) of the HDD 104 in advance. As one example, the threshold is 30 dB. When the volume level is equal to or greater than 30 dB as the threshold (S13:NO), the process goes to S11. The process for detecting the speaker's speech is executed again. When the volume level is smaller than 30 dB as the threshold (S13:YES), conversion from the speaker's speech into text data is carried out through the sound recognition based on the sound data (S14),” Fujino, col. 5, lines 40-56. Thus, the speech is converted to text when it is below a threshold (i.e., a type of condition). See also Fujino, fig. 5. And, “When the conversion of sound data into text data through the sound recognition is executed (S14), it is determined whether the conversion of the sound data into text data succeeded (S15). Concretely, the sound data is converted into text data, and it is determined whether the converted text data is stored in the predetermined storage area as the recognition result,” Fujino, col. 6, lines 15-21. And, “When the recognition result is stored in the predetermined storage area, namely, when it is determined that the conversion of the sound data into text data through the sound recognition succeeded (S15:YES), the text data as the recognition result stored in the predetermined storage area of the RAM 103 at S14 is transmitted to the other communication control apparatus 100 (S16),” Fujino, col. 6, lines 27-33. The communication control apparatus 100 is part of the conference device.); 

wherein the receiving of the voice text comprises: receiving the voice text sent by a server; or, 
receiving the voice text sent by the terminal (Fujino, col. 6, lines 27-33.); 

the voice text sent by the terminal is obtained by: the terminal sending the voice information to the server for voice recognition and receiving the voice text sent by the server; or,
the voice text sent by the terminal is obtained by: the terminal performing voice recognition on the voice information (Fujino, col. 6, lines 15-21. And, Fujino, col. 6, lines 27-33.). Fujino benefits Lord et al. by creating text data based on the input sound data and displaying them in chronological order, thereby allowing the contents spoken in the past to be visually referred
to, and closer communication made possible (Fujino, Background). Therefore, it would be obvious for one skilled in the art to combine the teachings of Lord et al. with those of Fujino to provide the conference system with smooth and efficient performance. 

Claim 36 cancelled.

Regarding claim 47 (Previously Presented), Lord et al., as modified by Marti et al., discloses the conference system according to claim 44, but not wherein the user terminal is further configured to: perform voice recognition on the collected voice information to obtain the voice text, and send the voice text to the conference device. 

Fujino is cited to disclose wherein the user terminal is further configured to: perform voice recognition on the collected voice information to obtain the voice text (“When sound data in the predetermined frequency domain is detected from the acquired sound data, namely, when it is determined that the speaker's speech is detected (S11:YES), a volume level of the acquired sound data is measured (S12). That is to say, a volume level of the speaker's voice acquired from the microphone 140 is measured. Thereafter, it is determined whether the volume level is smaller than a threshold (S13). The threshold is a value that is set in a threshold storage area (not shown) of the HDD 104 in advance. As one example, the threshold is 30 dB. When the volume level is equal to or greater than 30 dB as the threshold (S13:NO), the process goes to S11. The process for detecting the speaker's speech is executed again. When the volume level is smaller than 30 dB as the threshold (S13:YES), conversion from the speaker's speech into text data is carried out through the sound recognition based on the sound data (S14),” Fujino, col. 5, lines 40-56. See Fujino, fig. 5.), and send the voice text to the conference device (“When the conversion of sound data into text data through the sound recognition is executed (S14), it is determined whether the conversion of the sound data into text data succeeded (S15). Concretely, the sound data is converted into text data, and it is determined whether the converted text data is stored in the predetermined storage area as the recognition result,” Fujino, col. 6, lines 15-21. And, “When the recognition result is stored in the predetermined storage area, namely, when it is determined that the conversion of the sound data into text data through the sound recognition succeeded (S15:YES), the text data as the recognition result stored in the predetermined storage area of the RAM 103 at S14 is transmitted to the other communication control apparatus 100 (S16),” Fujino, col. 6, lines 27-33. The communication control apparatus 100 is part of the conference device.). Fujino benefits Lord et al. by creating text data based on the input sound data and displaying them in chronological order, thereby allowing the contents spoken in the past to be visually referred to, and closer communication made possible (Fujino, Background). Therefore, it would be obvious for one skilled in the art to combine the teachings of Lord et al. with those of Fujino to provide the conference system with smooth and efficient performance. 

Claim 48 cancelled.

Claim(s) 37 and 49 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20130144603, hereinafter referred to as Lord et al., in view of US 20220246168, hereinafter referred to as Marti et al., and further in view of US 20200279551, hereinafter referred to as Sung et al.

Regarding claim 37 (Previously Presented), Lord et al., as modified by Marti et al., discloses the method according to claim 34, wherein an operation of performing voice recognition on the voice information and determining the voice text corresponding to the voice information, comprises: performing voice recognition on the voice information through a connected edge end device, and determining the voice text corresponding to the voice information.

Sung et al. is cited to disclose wherein an operation of performing voice recognition on the voice information and determining the voice text corresponding to the voice information, comprises: performing voice recognition on the voice information through a connected edge end device (“The processor 120 of the electronic apparatus 100 of the disclosure may convert a received user voice into a text by using a voice recognition module. The voice recognition module may include a speech-to-text module. Here, the electronic apparatus 100 may receive a user voice from an external apparatus through a communicator (130 in FIG. 3), or may directly receive input of a user voice by a microphone (150 in FIG. 3) provided. Here, the external apparatus may be at least one of an edge device and an ambient device (ambience dot),” Sung et al., para [0070].), and determining the voice text corresponding to the voice information (Sung et al., para [0070].). Sung et al. benefits Lord et al. by performing speech recognition on a connected edge device, thereby reducing computational latency. Therefore, it would be obvious to combine the teachings of Lord et al. with those of Sung et al. to improve the voice conferencing latency of Lord et al. 

Regarding claim 49 (Previously Presented), Lord et al., as modified by Marti et al., discloses the conference system according to claim 45, wherein the conference device is configured to perform voice recognition on the voice information through a connected edge end device to obtain the voice text (“The processor 120 of the electronic apparatus 100 of the disclosure may convert a received user voice into a text by using a voice recognition module. The voice recognition module may include a speech-to-text module. Here, the electronic apparatus 100 may receive a user voice from an external apparatus through a communicator (130 in FIG. 3), or may directly receive input of a user voice by a microphone (150 in FIG. 3) provided. Here, the external apparatus may be at least one of an edge device and an ambient device (ambience dot),” Sung et al., para [0070].). Sung et al. benefits Lord et al. by performing speech recognition on a connected edge device, thereby reducing computational latency. Therefore, it would be obvious to combine the teachings of Lord et al. with those of Sung et al. to improve the voice conferencing latency of Lord et al. 

Claim(s) 40 and 51 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20130144603, hereinafter referred to as Lord et al., in view of US 20220246168, hereinafter referred to as Marti et al., and further in view of US 20130325972, hereinafter referred to as Boston et al.

Regarding claim 40 (Previously Presented), Lord et al., as modified by Marti et al., discloses the method according to claim 39, but not wherein after generating the conference record, the method further comprises: 

identifying key information in the conference record according to a text summarization algorithm, and generating a conference summary according to the identified key information; or, 

sending the conference record to a server, so that the server identifies key information in the conference record according to a text summarization algorithm to obtain a conference summary; and receiving the conference summary sent by the server; or, 

forwarding the conference record to a server through the terminal, so that the server identifies key information in the conference record according to a text summarization algorithm to obtain a conference summary; and receiving the conference summary forwarded by the server through the terminal; wherein the method further comprises: generating a download link address corresponding to at least one of the conference record or the conference summary.

Boston et al. is cited to disclose identifying key information in the conference record according to a text summarization algorithm, and generating a conference summary according to the identified key information (“Meeting digests 222 represent a set of one or more meeting digests. A meeting digest includes a set of one or more meeting summaries that is personalized to a particular user. A meeting digest may include keywords, selected video frames, snippets from email communications, snippets from instant messaging communications, snippets from chat room discussions, extracted audio content, speech to text phrases, extracted "to do" items from an electronic "to do" list, extracted project dates from an electronic calendar, or any other information that may be of interest to a particular user,” Boston et al., para [0035]. Here, the key information is keywords. And, “For example, digest email 400 includes personalized meeting summary 402, personalized meeting summary 404, and extracted relevant metadata 406. Personalized meeting summary 402 and personalized meeting summary 404 are summaries of two different meetings, which were attended by a particular user, that are customized to that particular user utilizing the user's preferences and interests included in a user model associated with that particular user. Both personalized meeting summary 402 and personalized meeting summary 404 include textual summary 408, visual summary 410, and keyword summary 412, such as textual summary 314, visual summary 316, and keyword summary 318 in FIG. 3,” Boston et al., para [0069].); or, 

sending the conference record to a server, so that the server identifies key information in the conference record according to a text summarization algorithm to obtain a conference summary; and receiving the conference summary sent by the server (“Server 106 may be, for example, a meeting digest server device that provides digests of meeting summaries,” Boston et al., para [0024]. Also, “Users may utilize clients 110, 112, and 114 to receive and view meeting digest information provided by, for example, server 106. Server 106 may send the meeting digest information to one or more of clients 110, 112, and 114 in, for example, an email or as a calendar entry,” Boston et al., [0026]. And, “In this example, persistent storage 208 stores meeting summaries 218, user models 220, and meeting digests 222. Meeting summaries 218 represent a set of one or more meeting summaries that are generated by a meeting digest server, such as server 106 in FIG. 1, from meeting data that are captured by a conferencing server, such as server 104 in FIG. 1. The captured meeting data may include, for example, textual data, such as speech to text transcriptions, documents, and annotations; visual data, such as video clips, slide data, picture data, and graphical data; and audio data. The meeting digest server may generate meeting summaries 218 by using, for example, a text-based analysis and/or a video-based analysis of the captured meeting data,” Boston et al., para [0032]-[0033].); or, 

forwarding the conference record to a server through the terminal, so that the server identifies key information in the conference record according to a text summarization algorithm to obtain a conference summary; and receiving the conference summary forwarded by the server through the terminal (“Conferencing server device 302 may send meeting recordings 308 to meeting digest server device 304 on a pre-determined time interval basis, such as daily, weekly, or monthly. Alternatively, conferencing server device 302 may send a recording of a meeting to meeting digest server device 304 at the completion of each meeting or on demand by meeting digest server device 304. Upon receiving meeting recordings 308, meeting digest server device 304 generates meetings summarization 310,” Boston et al., para [0054]. Also, it is clear by Boston et al., fig. 3, that the client device sends conference information to the conference server device.); 

wherein the method further comprises: generating a download link address corresponding to at least one of the conference record or the conference summary (“Furthermore, all meeting artifacts included within meeting digest with a personalized set of meeting summaries 338 may be hyperlinked to corresponding visual data segments or textual data segments of the meeting,” Boston et al., para [0063]. And, “Also, personalized meeting summary 402 and personalized meeting summary 404 include highlighted hyperlinks 414, which are active links to corresponding textual and visual data within meeting recordings, such as meeting recordings 308 in FIG. 3,” Boston et al., para [0069].). Boston et al. benefits Lord et al. by providing meeting summaries for those who wish to review the meeting’s key points without having to page through an entire transcript. Therefore, it would be obvious for one skilled in the art to combine the teachings of Lord et al. with those of Boston et al. to improve the conference history-generating capabilities of Lord et al. 

Regarding claim 51 (Previously Presented), Lord et al., as modified by Marti et al., discloses the conference system according to claim 44, wherein the voice text further comprises user information (Lord et al., fig. 1C.); 

wherein the user information is determined according to a voiceprint feature corresponding to the voice information, and the voiceprint feature is obtained by performing voiceprint recognition on the voice information (“The speaker recognizer 214 identifies the speaker based on acoustic properties of the speaker's voice, as reflected by the speech data received from the conferencing device 120. The speaker recognizer 214 may compare a speaker voice print to previously generated and recorded voice prints stored in the data store 240 in order to find a best or likely match. Voice prints or other signal properties may be determined with reference to voice mail messages, voice chat data, or some other corpus of speech data,” Lord et al., para [0054].); 

wherein the conference device is further configured to: 

generate a conference record according to the voice text (“At block 3.5204, the process performs informing the user of an identifier of a speaker along with a transcription of a previous utterance made by the speaker. As shown in FIG. 1C, a transcript may include a speaker's name displayed next to an utterance from that speaker,” Lord et al., para [0178]. The transcript is a conference record.); or, 

generate a conference record according to the voice text and a user name corresponding to the voice text (“At block 3.5204, the process performs informing the user of an identifier of a speaker along with a transcription of a previous utterance made by the speaker. As shown in FIG. 1C, a transcript may include a speaker's name displayed next to an utterance from that speaker,” Lord et al., para [0178]. Here, the speaker identifier is the user information corresponding to the voice text (i.e., speaker’s portion of transcript).). 

Lord et al., though, does not disclose wherein, the conference device is configured to: identify key information in the conference record according to a text summarization algorithm, and generate a conference summary according to the identified key information; or, 

the conference device is configured to send the conference record to a server; and the server is configured to identify key information in the conference record according to a text summarization algorithm to obtain a conference summary, and send the conference summary to the conference device; or, 

the conference device is configured to forward the conference record to a server through the terminal; and the server is configured to: identify key information in the conference record according to a text summarization algorithm to obtain a conference summary, and forward the conference summary through the terminal to the conference device; 

wherein the conference device is further configured to:

generate a download link address corresponding to at least one of the conference record or the conference summary.  

Boston et al. is cited to disclose wherein, the conference device is configured to: identify key information in the conference record according to a text summarization algorithm, and generate a conference summary according to the identified key information (“Meeting digests 222 represent a set of one or more meeting digests. A meeting digest includes a set of one or more meeting summaries that is personalized to a particular user. A meeting digest may include keywords, selected video frames, snippets from email communications, snippets from instant messaging communications, snippets from chat room discussions, extracted audio content, speech to text phrases, extracted "to do" items from an electronic "to do" list, extracted project dates from an electronic calendar, or any other information that may be of interest to a particular user,” Boston et al., para [0035]. Here, the key information is keywords. And, “For example, digest email 400 includes personalized meeting summary 402, personalized meeting summary 404, and extracted relevant metadata 406. Personalized meeting summary 402 and personalized meeting summary 404 are summaries of two different meetings, which were attended by a particular user, that are customized to that particular user utilizing the user's preferences and interests included in a user model associated with that particular user. Both personalized meeting summary 402 and personalized meeting summary 404 include textual summary 408, visual summary 410, and keyword summary 412, such as textual summary 314, visual summary 316, and keyword summary 318 in FIG. 3,” Boston et al., para [0069].); or, 

the conference device is configured to send the conference record to a server; and the server is configured to identify key information in the conference record according to a text summarization algorithm to obtain a conference summary, and send the conference summary to the conference device (“Server 106 may be, for example, a meeting digest server device that provides digests of meeting summaries,” Boston et al., para [0024]. Also, “Users may utilize clients 110, 112, and 114 to receive and view meeting digest information provided by, for example, server 106. Server 106 may send the meeting digest information to one or more of clients 110, 112, and 114 in, for example, an email or as a calendar entry,” Boston et al., [0026]. And, “In this example, persistent storage 208 stores meeting summaries 218, user models 220, and meeting digests 222. Meeting summaries 218 represent a set of one or more meeting summaries that are generated by a meeting digest server, such as server 106 in FIG. 1, from meeting data that are captured by a conferencing server, such as server 104 in FIG. 1. The captured meeting data may include, for example, textual data, such as speech to text transcriptions, documents, and annotations; visual data, such as video clips, slide data, picture data, and graphical data; and audio data. The meeting digest server may generate meeting summaries 218 by using, for example, a text-based analysis and/or a video-based analysis of the captured meeting data,” Boston et al., para [0032]-[0033].); or, 

the conference device is configured to forward the conference record to a server through the terminal; and the server is configured to: identify key information in the conference record according to a text summarization algorithm to obtain a conference summary, and forward the conference summary through the terminal to the conference device (“Conferencing server device 302 may send meeting recordings 308 to meeting digest server device 304 on a pre-determined time interval basis, such as daily, weekly, or monthly. Alternatively, conferencing server device 302 may send a recording of a meeting to meeting digest server device 304 at the completion of each meeting or on demand by meeting digest server device 304. Upon receiving meeting recordings 308, meeting digest server device 304 generates meetings summarization 310,” Boston et al., para [0054]. Also, it is clear by Boston et al., fig. 3, that the client device sends conference information to the conference server device.); 

wherein the conference device is further configured to:

generate a download link address corresponding to at least one of the conference record or the conference summary (“Furthermore, all meeting artifacts included within meeting digest with a personalized set of meeting summaries 338 may be hyperlinked to corresponding visual data segments or textual data segments of the meeting,” Boston et al., para [0063]. And, “Also, personalized meeting summary 402 and personalized meeting summary 404 include highlighted hyperlinks 414, which are active links to corresponding textual and visual data within meeting recordings, such as meeting recordings 308 in FIG. 3,” Boston et al., para [0069].). Boston et al. benefits Lord et al. by providing meeting summaries for those who wish to review the meeting’s key points without having to page through an entire transcript. Therefore, it would be obvious for one skilled in the art to combine the teachings of Lord et al. with those of Boston et al. to improve the conference history-generating capabilities of Lord et al. 

Claim(s) 43 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20130144603, hereinafter referred to as Lord et al., in view of US 20220246168, hereinafter referred to as Marti et al., and further in view of US 11539543, hereinafter referred to as Cho.

Regarding claim 43 (Previously Presented), Lord et al., as modified by Marti et al., discloses the method according to claim 33, wherein the displaying of the conference content related to the voice text, comprises at least one of: 

displaying the voice text in real time (Lord et al., para [0036] and [0048].); 

displaying a user name corresponding to the voice text in real time (“For example, in response to the onset of an utterance of a speaker, the AEFS 100 may display the name of the speaker on the display 121, so that the user is always informed as to who is speaking,” Lord et al., para [0033].); 

displaying a conference record related to the voice text (Lord et al., fig. 1C.); 

However, Lord et al. does not disclose wherein after displaying the conference content related to the voice text, the method further comprises: 

in response to a user's second editing instruction for at least one of the conference record or the conference summary, performing an editing operation on a content corresponding to the second editing instruction; 

wherein the editing operation comprises at least one of modification, addition, or deletion.
Cho is cited to disclose wherein after displaying the conference content related to the voice text, the method further comprises: 

in response to a user's second editing instruction for at least one of the conference record or the conference summary, performing an editing operation on a content corresponding to the second editing instruction (Cho, col. 14, lines 13-24.); 

wherein the editing operation comprises at least one of modification, addition, or deletion (“Then, the user terminal 420 may receive from the user 410 at least one second user input 614 within the user interface indicating a command to create a meeting note. For example, the user 410 may modify the title of online meeting displayed on the user interface for creating a meeting note or add new content to the text editing area to create a meeting note. Then, the user 410 may select a meeting note sharing button through a touch input or the like in order to share the meeting note he or she has written so as to transmit a meeting note sharing request 624 to the external device 430. The meeting note sharing request 624 may include information associated with the second user input 614,” Cho, col. 14, lines 13-24.). Cho benefits Lord et al. by allowing a user to make modifications to conference records after the meeting is concluded, thereby allowing the user to add clarifying comments, corrections, and other useful information for future review. Therefore, it would be obvious for one skilled in the art to combine the teachings of Lord et al. with those of Cho to improve the conference history-generating capabilities of Lord et al.  

Claim(s) 46 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20130144603, hereinafter referred to as Lord et al., in view of US 20220246168, hereinafter referred to as Marti et al., and further in view of US 20120290299, hereinafter referred to as Basson et al.

Regarding claim 46 (Previously Presented), Lord et al., as modified by Marti et al., discloses the conference system according to claim 44, further comprising a server (Lord et al., para [0190].); but not

wherein the user terminal is configured to send the collected voice information to the server; the server is configured to perform voice recognition on the voice information to obtain the voice text, and send the voice text to the user terminal; and the user terminal is configured to send the voice text to the conference device; or, 

the user terminal is configured to send the collected voice information to the conference device; the conference device is configured to forward the voice information to the server; and the server is configured to perform voice recognition on the voice information to obtain the voice text, and send the voice text to the conference device.  

Basson et al. is cited to disclose wherein 

the user terminal is configured to send the collected voice information to the conference device; the conference device is configured to forward the voice information to the server; and the server is configured to perform voice recognition on the voice information to obtain the voice text, and send the voice text to the conference device (“The system can be based, for example, on a voice-over-IP (VoIP) conferencing architecture. The speech transcription server 704 receives the voice flow from the PBX server 706, and transcribes the voice into text in real-time, and then sends the text flow back to the PBX server 706. The PBX server sends the text flow along with the voice and video streams to each client (participant) in the conference,” Basson et al., para [0042]. Here, the collected voice information is sent to a conference device (interpreted as private branch exchange (PBX) server 706) which then forwards the voice information to a transcription server for performing voice recognition on the voice information. The resulting voice text is then sent back to the conference device (i.e., PBX server).). Basson et al. benefits Lord et al. by providing techniques for recording and presenting voice conference history information based on speaker-related information determined from speaker utterances and/or other sources. Therefore, it would be obvious for one skilled in the art to combine the teachings of Lord et al. with those of Basson et al. to enhance the conference history-generation capabilities of Lord et al. 

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANNE L THOMAS-HOMESCU whose telephone number is (571)272-0899.  The examiner can normally be reached on Mon-Fri 8-6.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 5712727453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/ANNE L THOMAS-HOMESCU/Primary Examiner, Art Unit 2656
Read full office action
Prosecution Timeline

May 16, 2024
Application Filed
Nov 26, 2025
Non-Final Rejection mailed — §103
Feb 24, 2026
Response Filed
May 07, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/398,200
Patent 12639527
MACHINE LEARNING BASED SYSTEMS AND METHODS FOR GENERATING EMAILS
2y 5m to grant Granted May 26, 2026
18/439,966
Patent 12639511
MEMORY-BASED FUNCTION CALLING FOR LOGICAL FORMULATION AND INFERENCE FOR LARGE LANGUAGE MODELS
2y 3m to grant Granted May 26, 2026
18/391,536
Patent 12632477
RECOGNIZING POLLING QUESTIONS FROM A CONFERENCE CALL DISCUSSION
2y 5m to grant Granted May 19, 2026
18/395,189
Patent 12619821
Expediting Generative Token Production using Speculative Sampling, Added Guidance, and Language Models of Different Capacities
2y 4m to grant Granted May 05, 2026
18/102,472
Patent 12614556
ENCODING METHOD, ENCODING DEVICE, DECODING METHOD, AND DECODING DEVICE USING SCALAR QUANTIZATION AND VECTOR QUANTIZATION
3y 3m to grant Granted Apr 28, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
77%
Grant Probability
99%
With Interview (+35.8%)
2y 7m (~7m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 369 resolved cases by this examiner. Grant probability derived from career allowance rate.