Last updated: May 29, 2026
Application No. 18/254,568
VOICE PROCESSING METHOD, APPARATUS AND SYSTEM, SMART TERMINAL AND ELECTRONIC DEVICE

Non-Final OA §103
Filed
May 25, 2023
Priority
Dec 29, 2020 — CN 202011598381.X +1 more
Examiner
MASTERS, KRISTEN MICHELLE
Art Unit
2659
Tech Center
2600 — Communications
Assignee
BEIJING BYTEDANCE NETWORK TECHNOLOGY CO., LTD.
OA Round
2 (Non-Final)
This examiner grants 63% of cases after interview

— +22.3% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 46 resolved cases, 2023–2026
Examiner Intelligence

MASTERS, KRISTEN MICHELLE View full profile →
Grants 63% of resolved cases
Career Allowance Rate
29 granted / 46 resolved
+1.0% vs TC avg
Strong +22% interview lift
Without
With
+22.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
24 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
12.4%
-27.6% vs TC avg
§103
85.0%
+45.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 46 resolved cases
Office Action

§103
Detailed Action
This communication is in response to the Application filed on 9/17/2025. 
Claims 1-6, 8-18, 21-23 are pending and have been examined.
Claims 1-6, 8-18, 21-23 are rejected.
Claims 7 and 19-20 have been cancelled.
Claim 23 has been added
Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
The Applicants have amended the independent claims to include “wherein the method is applied to an application scenario comprising a smart terminal, a first terminal device, and a cloud server, the first terminal device is a device for a first participant user to conduct a remote conference with a second participant user, the smart terminal is a device for the second participant user to conduct the remote conference with the first participant user, and the method … collecting, by the smart terminal, … for a speech delivered by the second participant user during a process of the conference; generating, by the smart terminal, both … playing a voice of the speech delivered by the second participant user, … of the speech delivered by the second participant user, wherein a transcribed text corresponding to the recognition flow is obtained by performing the voice recognition on the recognition flow; and sending, by the smart terminal, … wherein sending the call flow and the recognition flow comprises: sending, by the smart terminal, the recognition flow to the cloud server, the recognition flow being sent to the first terminal device through the cloud server, so that the first terminal device performs the voice recognition to determine the transcribed text based on the recognition flow, and performs text display based on the transcribed text; and sending, by the smart terminal, the call flow to the cloud server, the call flow being sent to the first terminal device through the cloud server, so that the first terminal device plays the voice based on the call flow.”
Regarding the 35 USC § 101 rejection, The applicants’ arguments and amendments overcome the 35 USC § 101 rejection.
Regarding the 35 U.S.C. § 103 rejections, Applicant’s arguments with respect to claim(s) 1, 8 and 15 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Hence, new grounds for rejection have been made in view of Contreras (US Patent Number US 20140161243 A1), in view of Trim (US Patent Number US 20220021551 A1).


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 2, 8, 9, 14-17, 18, 21-23 are rejected under 35 U.S.C. 103 as being unpatentable over Contreras (US Patent Number US 20140161243 A1), in view of Trim (US Patent Number US 20220021551 A1).

Regarding claim 1, Contreras teaches A voice processing method, wherein the method is applied to an application scenario comprising a smart terminal, a first terminal device, and a cloud server, (see Contreras [0022-0023] “In another example embodiment, the cloud or remote data center 186 may run hosted applications for systems 110, 120, and 130. This may occur by establishing a virtual machine application executing software to manage applications hosted at the remote data center 186. Mobile information handling systems 110, 120, and 130 are adapted to run one or more applications locally, and to have hosted applications run in association with the local applications at remote data center 186. The virtual machine application may serve one or more applications to each of user mobile information handling systems 110, 120, and 130, including a conference call management system and the interface with such a system….[0023] The wireless adapters in systems 110, 120, and 130 can represent add-in cards, wireless network interface modules that are integrated with a main board of respective systems 110, 120, and 130 or integrated with another wireless network interface capability, or any combination thereof. In an embodiment the wireless adapters may include one or more radio frequency subsystems including transmitters and wireless controllers for connecting via a multitude of wireless links. In an example embodiment, a mobile information handling system may have a transmitter for WiFi or WiGig connectivity and one or more transmitters for macro-cellular communication. The radio frequency subsystems include wireless controllers to manage authentication, connectivity, communications, power levels for transmission, buffering, error correction, baseband processing, and other functions of the wireless adapters.”)  the first terminal device is a device for a first participant user to conduct a remote conference with a second participant user, the smart terminal is a device for the second participant user to conduct the remote conference with the first participant user, and the method comprises (see Contreras “[0039-0040] If at decision diamond 315, however, conference call bound identification data is found in the conference call database matching a dialed-from number, … the flow proceeds to block 355 where a list of all the scheduled conference calls is compiled….The list of conference calls is sent to the information handling system at the dialed-from number. … [0040] … At block 375, the method and conference call management system auto-enters a participant passcode for the scheduled conference call. The flow then proceeds to block 340 where access is granted to join the scheduled conference call. If the dialed-from number does match a meeting organizer at decision diamond 370, the flow proceeds to block 380. At block 380, the method and conference call management system auto-enters a leader passcode and, if necessary, also a participant passcode for the scheduled conference call. Then the flow proceeds to block 340 where access is granted to join the scheduled conference call. Upon access being granted, a connection to the scheduled conference call is made.”) (see Contreras [0043] The flow begins at 401 of FIG. 4A with a teleconference access attempt. At 405, the conference call management system receives a dialed-from phone number from a user information handling system such as 110, 120, or 130….”)
Contreras does not specifically teach collecting, by the smart terminal, audio information for a speech delivered by the second participant user during a process of the conference; However, Trim does teach this limitation (see Trim [0042] “In embodiments, user profile 122 can be configured to include all or some of the rules or action strategies associated with system profile module 112, as well as specific party/user preferences regarding the communication feed during a conference meeting or call. In other words, a party/user can identify particular audio (e.g., sounds, voices, etc.), video (objects, people, etc.), or static images that should be modified from the communication feed (i.e., audio feed and video feed) during a conference meeting, and those identified preferences can be added to user profile 122…”) generating, by the smart terminal, both a call flow and a recognition flow according to the audio information, wherein the call flow is used for playing a voice of the speech delivered by the second participant user, and the recognition flow is used for voice recognition of the speech delivered by the second participant user, (see Trim “[0045] “In embodiments, audio module 116 can be configured to analyze an audio feed from the communication feed of conferencing system 101 to identify people (i.e., via voice audio, such as spoken utterances), and various sounds from the environment audio (i.e., background noise). In some embodiments, audio module 116 detects a party's/user's contextual activity and/or the contextual activity between a first party and a second party using analysis techniques such as, topic modeling, neural networks, IBM WATSON® and/or machine learning modeling. In other embodiments, audio module 116 provides an audio feed to contextual analysis module 110 to analyze the contextual activity of a party and/or the contextual activity between a first party and a second party of a conference call. Audio module 116 can include any number of audio devices (such as, audio devices, and/or Internet of Things (IOT) sensor feeds) necessary to provide conferencing call functions described herein, such as speech recognition, conversation detections, and audio modification. Using voice recognition, audio module 116 can be configured to determine voice parameters (e.g., power bandwidth) that can be used to distinguish between the voice of each person of a party/user and identify a person. Once identified, those audio parameters can be added to the appropriate user profile 122 where they can be used to identify or acknowledge whether a person of a party can contribute to the audio feed of a conference meeting.”) wherein a transcribed text corresponding to the recognition flow is obtained by performing the voice recognition on the recognition flow; (see Trim [0047] “In embodiments, display module 118 can be configured to provide or display any information or data associated with the parties and/or users of conferencing system 101 during a conferencing call. In some embodiments, display module 118 can be configured to analyze the audio of one party. In these embodiments, display module 118 and associated display devices can utilize any number of known speech to text techniques to convert the audio feed to text, subtitles, or closed caption. For example, a conference meeting audio feed associated with a first participating party can be converted to text and displayed to a second participating party as a closed caption.”) and sending, by the smart terminal, the call flow and the recognition flow; wherein sending the call flow and the recognition flow comprises: sending, by the smart terminal, the recognition flow to the cloud server, the recognition flow being sent to the first terminal device through the cloud server, so that the first terminal device performs the voice recognition to determine the transcribed text based on the recognition flow, and performs text display based on the transcribed text; (see Trim [0048] “FIG. 1B illustrates a block diagram of a natural language processing system 120, configured to contextually analyze contextual activity associated with audio and video feeds during a conferencing meeting, in accordance with embodiments of the present invention. In some embodiments, conferencing system 101 may submit a communication feed (e.g., audio feed and video feed) from video module 114, audio module 116, and/or display module 118, containing contextual activity (i.e., conversations, sounds, video, and images) associated with at least one party of a conference call to be analyzed by natural language processing system 120. Natural language processing system 120 can use the contextual activity to identify particular contextual situations and determine possible actions strategies. In some embodiments, natural language processing system 120 can include a text-to-speech analyzer, allowing for contextual activity (e.g., conversations of first party) to be transcribed. In these embodiments the transcribed contextual activity can then be analyzed by natural language processing system 120. In embodiments of conferencing system 101 using display module 118 to display text or closed caption of an audio feed, natural language processing system 120 can be further configured to receive electronic documentation of the display and proceed with analyzing the text.”) and sending, by the smart terminal, the call flow to the cloud server, the call flow being sent to the first terminal device through the cloud server, so that the first terminal device plays the voice based on the call flow. (see Trim [0076] “The embodiment depicted in FIG. 4 can include a first party 402 in a first environment 404 having both audio and visual aspects. First party 402 can configure and control how first environment 404 and its members are viewed by communication feed 406 via at least one user profile 122 that can be retrieved from a cloud database. In some embodiments, IoT sensor feeds, neural network enabled cameras, and other devices can be used to view first environment 404 and observe or survey contextual activity. Contextual activity can include, but is not limited to, video, sound, voice audio (i.e., spoken utterances), images (e.g., objects or people) observed in first environment 404. In some embodiments, the contextual activity captured by conferencing system 101 can be collected outside the parameters of a conference call on a rolling basis to improve conferencing system 101's learning capabilities. In other embodiments, the contextual activity detected by conferencing system 101 is only collected during a conference call.”)
Contreras and Trim are in the same field of endeavor of signal processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Contreras to incorporate the teachings of Trim to include collecting, by the smart terminal, audio information for a speech delivered by the second participant user during a process of the conference; generating, by the smart terminal, both a call flow and a recognition flow according to the audio information, wherein the call flow is used for playing a voice of the speech delivered by the second participant user, and the recognition flow is used for voice recognition of the speech delivered by the second participant user, wherein a transcribed text corresponding to the recognition flow is obtained by performing the voice recognition on the recognition flow; and sending, by the smart terminal, the call flow and the recognition flow; wherein sending the call flow and the recognition flow comprises: sending, by the smart terminal, the recognition flow to the cloud server, the recognition flow being sent to the first terminal device through the cloud server, so that the first terminal device performs the voice recognition to determine the transcribed text based on the recognition flow, and performs text display based on the transcribed text; and sending, by the smart terminal, the call flow to the cloud server, the call flow being sent to the first terminal device through the cloud server, so that the first terminal device plays the voice based on the call flow. Doing so improves the responsiveness of conferencing system to the participating parties as recognized by Trim in [0066].

As to Claim 2, Contreras in view of Trim teaches 2. The method according to claim 1, (see Claim 1). 
Furthermore, Trim teaches, wherein generating both the call flow and the recognition flow respectively according to the audio information comprises: processing the audio information according to different processing methods to obtain the call flow and the recognition flow. (see Trim [0036] “In embodiments, conferencing system 101 can utilize one, some, or all of the modules and/or their described herein sub-components depicted in FIG. 1A to analyze contextual activity and modify a communication feed during a conference meeting. A communication feed can include an audio feed and a video feed obtained from conferencing system 101 during a conferencing call or meeting. Using video module 114 and audio module 116, conferencing system 101 can modify the communication feed acquired from the first party in a variety of different ways before displaying the modified communication feed to the other parties participating in the conferencing meeting. These possible modifications include but are not limited to: i) switching between communication modes, ii) adding and subtracting audio components from the audio feed, and iii) adding and subtracting video components (both moving and still images) from the video feed.”) (see Trim [0045] “In embodiments, audio module 116 can be configured to analyze an audio feed from the communication feed of conferencing system 101 to identify people (i.e., via voice audio, such as spoken utterances), and various sounds from the environment audio (i.e., background noise). In some embodiments, audio module 116 detects a party's/user's contextual activity and/or the contextual activity between a first party and a second party using analysis techniques such as, topic modeling, neural networks, IBM WATSON® and/or machine learning modeling. In other embodiments, audio module 116 provides an audio feed to contextual analysis module 110 to analyze the contextual activity of a party and/or the contextual activity between a first party and a second party of a conference call. Audio module 116 can include any number of audio devices (such as, audio devices, and/or Internet of Things (IOT) sensor feeds) necessary to provide conferencing call functions described herein, such as speech recognition, conversation detections, and audio modification. Using voice recognition, audio module 116 can be configured to determine voice parameters (e.g., power bandwidth) that can be used to distinguish between the voice of each person of a party/user and identify a person. Once identified, those audio parameters can be added to the appropriate user profile 122 where they can be used to identify or acknowledge whether a person of a party can contribute to the audio feed of a conference meeting.”)
Contreras and Trim are in the same field of endeavor of signal processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Contreras and Trim to incorporate the teachings of Trim to include generating both the call flow and the recognition flow respectively according to the audio information comprises: processing the audio information according to different processing methods to obtain the call flow and the recognition flow. Doing so improves the responsiveness of conferencing system to the participating parties as recognized by Trim in [0066].

Regarding Independent Claim 8, claim 8 is a device claim with limitations similar to that of claim 1 and is rejected under the same rationale. Furthermore, Trim teaches 8. A smart terminal, comprising: a microphone array, a processor and a communication module; (see Trim Figure 2 element 206 and 208, see Contreras [0146] “FIG. 10, illustrated is a high-level block diagram of an example computer system 1001 that may be used in implementing one or more of the methods, tools, and modules, and any related functions, described herein (e.g., using one or more processor circuits or computer processors of the computer), in accordance with embodiments of the present invention. In some embodiments, the major components of the computer system 1001 may comprise one or more Processor 1002, a memory subsystem 1004, a terminal interface 1012, a storage interface 1016, an I/O (Input/Output) device interface 1014, and a network interface 1018, all of which may be communicatively coupled, directly or indirectly, for inter-component communication via a memory bus 1003, an I/O bus 1008, and an I/O bus interface unit 1010.”) (see Contreras [0032] “In embodiments, user device 104 includes user interface 108. User interface 108 provides an interface between each user device 104 and conferencing system 101. User interface 108 can be a graphical user interface (GUI), a web user interface (WUI) or any other suitable interface for a user to interact with and execute the methods and/or techniques described herein.”)
Contreras and Trim are in the same field of endeavor of signal processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the device of Contreras to incorporate the teachings of Trim to include A smart terminal, comprising: a microphone array, a processor and a communication module; Doing so improves the responsiveness of conferencing system to the participating parties as recognized by Trim in [0066].

As to Claim 9, claim 9 is a device claim with limitations similar to that of claim 2 and is rejected under the same rationale.

Regarding claim 14, Contreras in view of Trim teaches 14. The smart terminal according to claim 8, 
Furthermore, Contreras teaches, further comprising: a speaker, configured to play a voice based on a call flow sent by a first terminal device. (see Contreras see Fig. 1, element 130, As is well known in the art, a cellular telephone has as voice input a microphone, and routes the signal to audio output the cellular telephone earpiece.”)

Regarding Independent Claim 15, claim 15 is a device claim with limitations similar to that of claim 1 and is rejected under the same rationale. Furthermore, Contreras teaches 15. (Currently amended) A voice processing apparatus, comprising: at least one processor and a memory; wherein, the memory stores computer-executable instructions; and the at least one processor executes the computer-executable instructions stored in the memory to enable the at least one processor to: (see Contreras [0057] FIG. 5 shows a method 500 capable of administering each of the specific embodiments of the present disclosure. The information handling system 500 can represent the user information handling systems 110, 120 and 130 or servers or systems located anywhere within network 100 of FIG. 1, including the teleconference bridge system 195 or the remote data center or cloud 180 operating the virtual machine applications described herein. The information handling system 500 may include a processor 502 such as a central processing unit (CPU), a graphics processing unit (GPU), or both. Moreover, the information handling system 500 can include a main memory 504 and a static memory 507 that can communicate with each other via a bus 508. The information handling system 500 includes signal generation device 518 such as for a speaker or a remote control. The information handling system 500 can also include a disk drive unit 516, and a network interface device 520. As shown, the information handling system 500 may further include a video display unit 510, such as a liquid crystal display (LCD), an organic light emitting diode (OLED), a flat panel display, a solid state display, or a cathode ray tube (CRT). The video display unit 510 may also act as an input accepting touchscreen inputs. Additionally, the information handling system 500 may include an input device 512, such as a keyboard, or a cursor control device, such as a mouse or touch pad. Information handling system may include a battery system 514. The information handling system 500 can represent a device capable of telecommunications and whose can be share resources, voice communications, and data communications among multiple devices. The information handling system 500 can also represent a server device whose resources can be shared by multiple client devices, or it can represent an individual client device, such as a laptop or tablet personal computer.”)

Regarding Claim 16, Contreras in view of Trim teaches 16. A voice processing system, comprising: a first terminal device and the smart terminal according to claim 8. (see claim 8).

Regarding claim 17, Contreras in view of Trim teaches the voice processing method according to claim 1
Furthermore, Contreras teaches, 17. An electronic device, comprising: at least one processor and a memory; wherein, the memory stores computer-executable instructions; and the at least one processor executes the computer-executable instructions stored in the memory to cause the at least one processor executes (see Contreras [0057] FIG. 5 shows a method 500 capable of administering each of the specific embodiments of the present disclosure. The information handling system 500 can represent the user information handling systems 110, 120 and 130 or servers or systems located anywhere within network 100 of FIG. 1, including the teleconference bridge system 195 or the remote data center or cloud 180 operating the virtual machine applications described herein. The information handling system 500 may include a processor 502 such as a central processing unit (CPU), a graphics processing unit (GPU), or both. Moreover, the information handling system 500 can include a main memory 504 and a static memory 507 that can communicate with each other via a bus 508. The information handling system 500 includes signal generation device 518 such as for a speaker or a remote control. The information handling system 500 can also include a disk drive unit 516, and a network interface device 520. As shown, the information handling system 500 may further include a video display unit 510, such as a liquid crystal display (LCD), an organic light emitting diode (OLED), a flat panel display, a solid state display, or a cathode ray tube (CRT). The video display unit 510 may also act as an input accepting touchscreen inputs. Additionally, the information handling system 500 may include an input device 512, such as a keyboard, or a cursor control device, such as a mouse or touch pad. Information handling system may include a battery system 514. The information handling system 500 can represent a device capable of telecommunications and whose can be share resources, voice communications, and data communications among multiple devices. The information handling system 500 can also represent a server device whose resources can be shared by multiple client devices, or it can represent an individual client device, such as a laptop or tablet personal computer.”)

Regarding claim 18, Contreras in view of Trim teaches the voice processing method according to claim 1 
Furthermore, Contreras teaches, 18. A non-transitory computer-readable storage medium, wherein the computer-readable storage medium stores computer-executable instructions which, when executed by a processor, implement (see Contreras [0034] “Conferencing system 101 can be a standalone computing system, a server, and/or a virtualized system running on one or more servers within a cloud networking environment capable of analyzing contextual activity and modifying the communication feed during a conference meeting for participating parties/users connected to network 102. Conferencing system 101 can include contextual analysis module 110, system profile module 112, video module 114, audio module 116, and display module 118. The term “module” may refer to a hardware module, software module, or a module may be a combination of hardware and software resources. Embodiments of hardware-based modules may include self-contained components such as chipsets, specialized circuitry, one or more memory devices and/or persistent storage (see FIG. 10). A software-based module may be part of a program (e.g., programs 1028, FIG. 10), program code or linked to program code containing specifically programmed instructions loaded into a memory device or persistent storage device of one a data processing systems operating as part of the networking environment 100. For example, data associated with contextual analysis module 110, system profile module 112, video module 114, audio module 116, and/or display module 118, depicted in FIG. 1, can be loaded into memory or database, such as database 106.”)

Regarding claim 21, Contreras in view of Trim teaches the voice processing apparatus according to claim 15
Furthermore, Contreras teaches 21. A voice processing system, comprising: a first terminal device and; (see Contreras Figure 1A elements 116 and 120)

As to Claim 22, claim 22 is a device claim with limitations similar to that of claim 2 and is rejected under the same rationale.

Regarding claim 23, Contreras in view of Trim teaches 23. The method according to claim 1, 
Furthermore, Contreras teaches, wherein the application scenario further comprises a second terminal device for the second participant user to conduct the remote conference with the first participant user, and a third terminal device not in the remote conference; and wherein the cloud server further sends the recognition flow to the third terminal device, so that the third terminal device performs the voice recognition to determine the transcribed text based on the recognition flow, and performs text display based on the transcribed text. (see Contreras Figure 1A and see Contreras [0058] “In some embodiments, the output of natural language processor 124 may be used by search application 128 to perform a search of a set of (e.g., one or more) corpora to retrieve one or more subdivisions including a particular requirement associated with the contextual activity and send the output (i.e., contextual situation) to a word processing system and to a comparator. As used herein, a corpus may refer to one or more data sources, such as data source 126. In some embodiments, data source 126 may include video libraries, data warehouses, information corpora, data models, and document repositories, and a historical repository of communication feed associated with conferencing system 101. In some embodiments, data source 126 may include an information corpus 146. Information corpus 146 may enable data storage and retrieval. In some embodiments, information corpus 146 may be a subject repository that houses a standardized, consistent, clean, and integrated list of words, images, and dialogue. For example, information corpus 146 may include verbal statements made by a storage provider representative (e.g., a phone message where a representative states that 1 terabyte of cloud storage can be provided by their storage provider). The data may be sourced from various operational systems. Data stored in information corpus 146 may be structured in a way to specifically address reporting and analytic requirements. In some embodiments, information corpus 146 may be a relational database or a text index.”)

Claims 3, 6, 10 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Contreras (US Patent Number US 20140161243 A1), in view of Trim (US Patent Number US 20220021551 A1), And further in view of Gargaro (US Patent Number US 20200274911 A1).

As to Claim 3, Contreras in view of Trim teaches 3. The method according to claim 2, 
Furthermore, Trim teaches wherein processing the audio information according to the different processing methods to obtain the call flow and the recognition flow comprises: performing clarity enhancement processing on the audio information to obtain the call flow; (see Trim [0025] “Embodiments of the present invention provide a more robust way for parties to constructively participate in conference calls. Embodiments can include, but are not limited to: analyzing contextual activity to determine contextual situations (e.g., a mode switch indicator) that, once observed during the conference call can trigger particular action strategies (e.g., communication mode switching); modifying audio transmitted during a conference call by adding and/or subtracting audio components (e.g., voices and sounds) based, at least in part, on a user profile; and modifying video transmitted during a conference call by adding and/or subtracting visual components (e.g., people and objects).”)
Contreras in view of Trim do not specifically teach and performing fidelity processing on the audio information to obtain the recognition flow. However, Gargaro does teach this limitation (see Gargaro [0029] “Moving to FIG. 1B, each client 105 in the mute mode transmits the test signals to the server 110. The server 110 verifies the quality of the conference call for the participant of the client 105 according to a comparison of the received (audio) signals that are received by the client 105 with the corresponding test signals (which are expected). For example, when the test signals in a specific range of frequencies are cut in the received signals, it is possible to ascertain that the fidelity of a corresponding audio channel (between the client 105 and the server 110) is low; moreover, when the test signals are completely missing in the received signals, it is possible to ascertain that the audio channel is broken. At the same time, the server 110 prevents the test signals to be transmitted to the other clients 105, for example, by suppressing them from the received signals that are broadcast thereto (so as to avoid adding a corresponding noise to the voice of the participant that is currently speaking in the conference call)..”)
Contreras in view of Trim and further in view of Gargaro are in the same field of endeavor of signal processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Contreras and Trim to incorporate the teachings of Gargaro to include performing fidelity processing on the audio information to obtain the recognition flow. Doing so allows for the quality of the audio to be monitored to ensure that each speaking participant is heard as recognized by Gargaro in [0003-0005].

Regarding claim 6, Contreras in view of Trim and further in view of Gargaro teaches 6. The method according to claim 3,
Contreras in view of Trim and further in view of Gargaro do not specifically teach before performing the clarity enhancement processing on the audio information to obtain the call flow, However, Weisman does teach this limitation (see Weisman [0116] As taught in the prior art, incoming voice signals range in amplitude, noise content, and activity and require some conditioning before voice signals are combined in a conference bridge. Signal processing blocks 211-214 provide such conditioning. Each of signal processing blocks 211-214 processes input signals Talker 1-Talker N respectively to provide the functions of noise gate (to eliminate low noise levels during intervals of silence), automatic gain control (to raise the voice level of a quiet speaker or a participant station having a low gain voice input 170).”) (see Weisman [0115] “FIG. 2 shows the preferred signal processing logic comprising voice bridge 70. Voice signals from receive buffer 60 (not shown in FIG. 2) arrive as distinctly identified signals Talker 1 through Talker N. These identified signals each correspond to a different one of the plurality of participant stations 10, 12. Here, N would be the number of participant stations 10, 12 actively connected in a conference call.”) the method further comprises: performing echo cancellation processing on the audio information. (see Weisman [0116] As taught in the prior art, incoming voice signals range in amplitude, noise content, and activity and require some conditioning before voice signals are combined in a conference bridge. Signal processing blocks 211-214 provide such conditioning. Each of signal processing blocks 211-214 processes input signals Talker 1-Talker N respectively to provide the functions of noise gate (to eliminate low noise levels during intervals of silence), automatic gain control (to raise the voice level of a quiet speaker or a participant station having a low gain voice input 170).”)
Contreras in view of Trim and further in view of Gargaro and Weisman are in the same field of endeavor of signal processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Contreras and Trim and Gargaro to incorporate the teachings of Weisman to include before performing the clarity enhancement processing on the audio information to obtain the call flow, the method further comprises: performing echo cancellation processing on the audio information. Doing so allows for the voice signals to be properly conditioned for conversations as recognized by Weisman in [0115].
Furthermore, Gargaro teaches and performing the fidelity processing on the audio information to obtain the recognition flow, (see Gargaro [0029] “Moving to FIG. 1B, each client 105 in the mute mode transmits the test signals to the server 110. The server 110 verifies the quality of the conference call for the participant of the client 105 according to a comparison of the received (audio) signals that are received by the client 105 with the corresponding test signals (which are expected). For example, when the test signals in a specific range of frequencies are cut in the received signals, it is possible to ascertain that the fidelity of a corresponding audio channel (between the client 105 and the server 110) is low; moreover, when the test signals are completely missing in the received signals, it is possible to ascertain that the audio channel is broken. At the same time, the server 110 prevents the test signals to be transmitted to the other clients 105, for example, by suppressing them from the received signals that are broadcast thereto (so as to avoid adding a corresponding noise to the voice of the participant that is currently speaking in the conference call)..”)
Contreras in view of Trim and further in view of Weisman and Gargaro are in the same field of endeavor of signal processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Contreras and Trim and Gargaro and Weisman to incorporate the teachings of Gargaro to include performing fidelity processing on the audio information to obtain the recognition flow. Doing so allows for the quality of the audio to be monitored to ensure that each speaking participant is heard as recognized by Gargaro in [0003-0005].


As to Claim 10, claim 10 is a device claim with limitations similar to that of claim 3 and is rejected under the same rationale.

As to Claim 13, claim 13 is a device claim with limitations similar to that of claim 6 and is rejected under the same rationale.

Claims 4, 5, 11, 12 are rejected under 35 U.S.C. 103 as being unpatentable over Contreras (US Patent Number US 20140161243 A1), in view of Trim (US Patent Number US 20220021551 A1), And further in view of Gargaro (US Patent Number US 20200274911 A1), And further in view of Weisman (US Patent Number US 20040047461 A1).

As to Claim 4, Contreras in view of Trim and further in view of Gargaro teaches 4. The method according to claim 3
Contreras in view of Trim and further in view of Gargaro do not specifically teach wherein performing the clarity enhancement processing on the audio information to obtain the call flow comprises: performing noise reduction processing and automatic gain control on the audio information to obtain the call flow. However, Weisman does teach this limitation (see Weisman [0116] “As taught in the prior art, incoming voice signals range in amplitude, noise content, and activity and require some conditioning before voice signals are combined in a conference bridge. Signal processing blocks 211-214 provide such conditioning. Each of signal processing blocks 211-214 processes input signals Talker 1-Talker N respectively to provide the functions of noise gate (to eliminate low noise levels during intervals of silence), automatic gain control (to raise the voice level of a quiet speaker or a participant station having a low gain voice input 170).”) (see Weisman [0115] “FIG. 2 shows the preferred signal processing logic comprising voice bridge 70. Voice signals from receive buffer 60 (not shown in FIG. 2) arrive as distinctly identified signals Talker 1 through Talker N. These identified signals each correspond to a different one of the plurality of participant stations 10, 12. Here, N would be the number of participant stations 10, 12 actively connected in a conference call.”).”)
Contreras in view of Trim and further in view of Gargaro and Weisman are in the same field of endeavor of signal processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Contreras and Trim and Gargaro to incorporate the teachings of Weisman to include performing the clarity enhancement processing on the audio information to obtain the call flow comprises: performing noise reduction processing and automatic gain control on the audio information to obtain the call flow. Doing so allows for the voice signals to be properly conditioned for conversations as recognized by Weisman in [0115].

As to Claim 5, Contreras in view of Trim and further in view of Gargaro teaches 5. The method according to claim 3, 
Contreras in view of Trim and further in view of Gargaro do not specifically teach wherein performing the fidelity processing on the audio information to obtain the recognition flow comprises: performing beam selection processing on the audio information to obtain the recognition flow. However, Weisman does teach this limitation (see Weisman [0174] “The Location field contains the current location of the participant, as well as can be determined. The Location field is preferably dynamic and is updated frequently. If the participant station in use is stationary, the street address will suffice for determining the location, by mechanisms well known. If the participant station is a cellular telephone, then either a GPS-based location or cellular antenna beam or other datum is used to derive the Location. The Location field may be considered as an element of the private part of participant data.”)
Contreras in view of Trim and further in view of Gargaro and Weisman are in the same field of endeavor of signal processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of Contreras and Trim and Gargaro to incorporate the teachings of Weisman to include performing the fidelity processing on the audio information to obtain the recognition flow comprises: performing beam selection processing on the audio information to obtain the recognition flow. Doing so allows for the voice signals to be properly conditioned for conversations as recognized by Weisman in [0115].

As to Claim 11, claim 11 is a device claim with limitations similar to that of claim 4 and is rejected under the same rationale. 

As to Claim 12, claim 12 is a device claim with limitations similar to that of claim 5 and is rejected under the same rationale.


Conclusion
THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  

Any inquiry concerning this communication or earlier communications from the examiner should be directed to KRISTEN MICHELLE MASTERS whose telephone number is (703)756-1274. The examiner can normally be reached M-F 8:30 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Louis Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/KRISTEN MICHELLE MASTERS/Examiner, Art Unit 2659 

/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659
Read full office action
Prosecution Timeline

May 25, 2023
Application Filed
Jul 09, 2025
Non-Final Rejection mailed — §103
Sep 17, 2025
Response Filed
Jan 14, 2026
Final Rejection mailed — §103
Mar 06, 2026
Response after Non-Final Action
Apr 01, 2026
Request for Continued Examination
Apr 07, 2026
Response after Non-Final Action
Precedent Cases

Applications granted by this same examiner with similar technology

17/513,614
Patent 12592219
Hearing Device User Communicating With a Wireless Communication Device
4y 5m to grant Granted Mar 31, 2026
17/415,675
Patent 12548569
METHOD AND SYSTEM OF DETECTING AND IMPROVING REAL-TIME MISPRONUNCIATION OF WORDS
3y 2m to grant Granted Feb 10, 2026
17/790,795
Patent 12548564
SYSTEM AND METHOD FOR CONTROLLING A PLURALITY OF DEVICES
3y 7m to grant Granted Feb 10, 2026
17/940,549
Patent 12547894
ENTROPY-BASED ANTI-MODELING FOR MACHINE LEARNING APPLICATIONS
3y 5m to grant Granted Feb 10, 2026
18/311,150
Patent 12547840
MULTI-STAGE PROCESSING FOR LARGE LANGUAGE MODEL TO ANSWER MATH QUESTIONS MORE ACCURATELY
2y 9m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

2-3
Expected OA Rounds
63%
Grant Probability
85%
With Interview (+22.3%)
3y 0m (~0m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 46 resolved cases by this examiner. Grant probability derived from career allowance rate.