Last updated: April 19, 2026

Application No. 18/843,685

METHOD AND DEVICE OF STREAM MERGING FOR SPEECH CO-HOSTING

Non-Final OA §102§112

Filed

Sep 03, 2024

Examiner

HACKENBERG, RACHEL J

Art Unit

2454

Tech Center

2400 — Computer Networks

Assignee

BEIJING ZITIAO NETWORK TECHNOLOGY CO., LTD.

OA Round

1 (Non-Final)

Interview Optional

— +26.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 300 resolved cases, 2023–2026

Examiner Intelligence

HACKENBERG, RACHEL J View full profile →

Grants 79% — above average

Career Allow Rate

236 granted / 300 resolved

+20.7% vs TC avg

Strong +26% interview lift

Without

With

+26.4%

Interview Lift

resolved cases with interview

Typical timeline

2y 10m

Avg Prosecution

35 currently pending

Career history

335

Total Applications

across all art units

Statute-Specific Performance

§101

4.9%

-35.1% vs TC avg

§103

53.2%

+13.2% vs TC avg

§102

14.2%

-25.8% vs TC avg

§112

17.8%

-22.2% vs TC avg

Black line = Tech Center average estimate • Based on career data from 300 resolved cases

Office Action

§102 §112

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.

Information Disclosure Statement
The information disclosure statement (IDS) was submitted on 09/22/2024.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 112
112b:
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

Claim(s) 4, 17 is/are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Claim 4 recites the limitation “the device” in line 1.  This renders the claim unclear as there is insufficient antecedent basis for this limitation in the claim.  Claim 4 depends on Claim 1 and previous limitations do not recite “a device”.  This same rejection applies to Claim 17.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless – (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-7, 9-10, 13-23 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by US 2018/0288467 Al (Holmberg).

Regarding Claim 1:
Holmberg teaches A method of stream merging, comprising: 
obtaining a first speech stream comprising speech information of a first user (ie. second performer vocals) associated with a live streaming interaction event (ie. livestream vocal duet); ([0025] The received media encoding includes video that is performance synchronized with the captured first performer vocals.  [0029] The broadcast mix is presented as a vocal duet.  Audio of the live stream includes both conversational-type audio portions captured in correspondence with interactive conversation between the first and second performers.  [0045] Techniques have been developed to facilitate the livestreaming of group audiovisual performances.  Audiovisual performances including vocal music are captured and coordinated with performances of other users in ways that can create compelling user and listener experiences.)
obtaining a second speech stream (ie. second performer vocals) and a first image, the second speech stream comprising speech information of a second user associated with the live streaming interaction event, the first image comprising image information of the second user (ie. video of the first and second performers);   ([0025] The received media encoding includes video that is performance synchronized with the captured first performer vocals, the method further includes capturing, at the host device, video that is performance synchronized with the captured second performer vocals, and the broadcast mix is an audiovisual mix of captured audio and video of at least the first and second performers.)  
merging the first speech stream, the second speech stream, and the first image (ie. video of first and second performers) to obtain first merged streaming data;   ([0025] The received media encoding includes video that is performance synchronized with the captured first performer vocals, the method further includes capturing, at the host device, video that is performance synchronized with the captured second performer vocals, and the broadcast mix is an audiovisual mix of captured audio and video of at least the first and second performers.)
obtaining a second image indicating image information of the first user (ie. video feature of the first performer and/or second performer);   ([0026] The method further includes dynamically varying in the broadcast mix at least visual prominence of one or the other of the first and second performers based on evaluation of a computationally audio defined feature of either or both of the first and second performer vocals.   The method further includes applying one or more video effects to the broadcast mix based, at least in part, on a computationally defined audio or video feature of either or both of the first and second performer audio or video.)
and encoding the second image and the first merged streaming data to obtain second merged streaming data.  ([0026] In some embodiments, the method further includes dynamically varying in the broadcast mix at least visual prominence of one or the other of the first and second performers based on evaluation of a computationally audio defined feature of either or both of the first and second performer vocals. In some embodiments, the method further includes applying one or more video effects to the broadcast mix based, at least in part, on a computationally defined audio or video feature of either or both of the first and second performer audio or video.)  The first merged streaming data is the broadcast mix and it’s encoded with video feature of either or both first and second performer video to provide the second merged streaming data.

Regarding Claim 9:
Holmberg teaches An electronic device comprising a processor and a memory; the memory storing computer execution instructions; the processor executing the computer execution instructions stored in the memory to cause the processor to perform the acts ([0004] Computationally, these computing (mobile) devices offer speed and storage capabilities comparable to engineering workstation or workgroup computers from less than ten years ago, and typically include powerful media processors, rendering them suitable for real-time sound synthesis and other musical applications.  [0041] FIG. 5 is a functional block diagram of hardware and software components executable at an illustrative mobile phone-type portable computing device to facilitate processing and communication of a captured audiovisual performance for use in a multi-vocalist livestreaming configuration of network-connected devices in accordance with some embodiments of the present invention(s).) comprising: 
obtaining a first speech stream comprising speech information of a first user (ie. second performer vocals) associated with a live streaming interaction event (ie. livestream vocal duet); ([0025] The received media encoding includes video that is performance synchronized with the captured first performer vocals.  [0029] The broadcast mix is presented as a vocal duet.  Audio of the live stream includes both conversational-type audio portions captured in correspondence with interactive conversation between the first and second performers.  [0045] Techniques have been developed to facilitate the livestreaming of group audiovisual performances.  Audiovisual performances including vocal music are captured and coordinated with performances of other users in ways that can create compelling user and listener experiences.)
obtaining a second speech stream (ie. second performer vocals) and a first image, the second speech stream comprising speech information of a second user associated with the live streaming interaction event, the first image comprising image information of the second user (ie. video of first and second performers);   ([0025] The received media encoding includes video that is performance synchronized with the captured first performer vocals, the method further includes capturing, at the host device, video that is performance synchronized with the captured second performer vocals, and the broadcast mix is an audiovisual mix of captured audio and video of at least the first and second performers.)
merging the first speech stream, the second speech stream, and the first image (ie. video of first and second performers) to obtain first merged streaming data;   ([0025] The received media encoding includes video that is performance synchronized with the captured first performer vocals, the method further includes capturing, at the host device, video that is performance synchronized with the captured second performer vocals, and the broadcast mix is an audiovisual mix of captured audio and video of at least the first and second performers.) 
obtaining a second image indicating image information of the first user (ie. video feature of the first performer and/or second performer);   ([0026] The method further includes dynamically varying in the broadcast mix at least visual prominence of one or the other of the first and second performers based on evaluation of a computationally audio defined feature of either or both of the first and second performer vocals.   The method further includes applying one or more video effects to the broadcast mix based, at least in part, on a computationally defined audio or video feature of either or both of the first and second performer audio or video.)
and encoding the second image and the first merged streaming data to obtain second merged streaming data.  ([0026] In some embodiments, the method further includes dynamically varying in the broadcast mix at least visual prominence of one or the other of the first and second performers based on evaluation of a computationally audio defined feature of either or both of the first and second performer vocals. In some embodiments, the method further includes applying one or more video effects to the broadcast mix based, at least in part, on a computationally defined audio or video feature of either or both of the first and second performer audio or video.)  The first merged streaming data is the broadcast mix and it’s encoded with video feature of either or both first and second performer video to provide the second merged streaming data.

Regarding Claim 10:
Holmberg teaches A non-transitory computer-readable storage medium storing program code for computer execution, the program code comprising instructions for performing the acts ([0086] Embodiments in accordance with the present invention may take the form of, and/or be provided as, a computer program product encoded in a machine-readable medium as instruction sequences and other functional constructs of software.) comprising:
obtaining a first speech stream comprising speech information of a first user (ie. second performer vocals) associated with a live streaming interaction event (ie. livestream vocal duet); ([0025] The received media encoding includes video that is performance synchronized with the captured first performer vocals.  [0029] The broadcast mix is presented as a vocal duet.  Audio of the live stream includes both conversational-type audio portions captured in correspondence with interactive conversation between the first and second performers.  [0045] Techniques have been developed to facilitate the livestreaming of group audiovisual performances.  Audiovisual performances including vocal music are captured and coordinated with performances of other users in ways that can create compelling user and listener experiences.)
obtaining a second speech stream (ie. second performer vocals) and a first image, the second speech stream comprising speech information of a second user associated with the live streaming interaction event, the first image comprising image information of the second user (ie. video of first and second performers);   ([0025] The received media encoding includes video that is performance synchronized with the captured first performer vocals, the method further includes capturing, at the host device, video that is performance synchronized with the captured second performer vocals, and the broadcast mix is an audiovisual mix of captured audio and video of at least the first and second performers.)
merging the first speech stream, the second speech stream, and the first image (ie. video of first and second performers) to obtain first merged streaming data;   ([0025] The received media encoding includes video that is performance synchronized with the captured first performer vocals, the method further includes capturing, at the host device, video that is performance synchronized with the captured second performer vocals, and the broadcast mix is an audiovisual mix of captured audio and video of at least the first and second performers.) 
obtaining a second image indicating image information of the first user (ie. video feature of the first performer and/or second performer);   ([0026] The method further includes dynamically varying in the broadcast mix at least visual prominence of one or the other of the first and second performers based on evaluation of a computationally audio defined feature of either or both of the first and second performer vocals.   The method further includes applying one or more video effects to the broadcast mix based, at least in part, on a computationally defined audio or video feature of either or both of the first and second performer audio or video.)
and encoding the second image and the first merged streaming data to obtain second merged streaming data.  ([0026] In some embodiments, the method further includes dynamically varying in the broadcast mix at least visual prominence of one or the other of the first and second performers based on evaluation of a computationally audio defined feature of either or both of the first and second performer vocals. In some embodiments, the method further includes applying one or more video effects to the broadcast mix based, at least in part, on a computationally defined audio or video feature of either or both of the first and second performer audio or video.)  The first merged streaming data is the broadcast mix and it’s encoded with video feature of either or both first and second performer video to provide the second merged streaming data.

Regarding Claims 2, 15, 22:
Holmberg teaches on the inventions of claims 1, 9, 10 as described.
Holmberg teaches wherein the obtaining a first speech stream comprises:  obtaining the first speech stream from a forward server, the forward server (ie. content server) configured to forward speech information of the first user.   ([0052] FIG. 1, iPhone™ handhelds available from Apple Inc. (or more generally, handhelds 101A, 101B operating as guest and host devices, respectively) execute software that operates in coordination with a content server 110 to provide vocal capture.  [0053] A current guest user of current guest device 101A contributes to the group audiovisual performance mix 111 that is supplied ( eventually via content server 110) by current host device 101B as live stream 122.  [0058] User vocals 103A and 103B are captured at respective handhelds 101A, 101B, and may be optionally pitch-corrected continuously and in real-time and audibly rendered mixed with the locally-appropriate backing track ( e.g., backing track 107A at current guest device 101A and guest mix 106 at current host device 101B).)  The content server obtains/receives and forwards/provides audio and image data between user devices.

Regarding Claims 3, 16, 23:
Holmberg teaches on the inventions of claims 1, 9, 10 as described.
Holmberg teaches wherein a device for stream merging is comprised in a live streamer terminal. ([0041] FIG. 5 is a functional block diagram of hardware and software components executable at an illustrative mobile phone-type portable computing device to facilitate processing and communication of a captured audiovisual performance for use in a multi-vocalist livestreaming configuration of network-connected devices.)  Components shown in Figs 4& 5 capture livestreams and then send to encoder (from both the camera and the mic).  The encoder is the device within the mobile device (live streamer terminal).

Regarding Claims 4, 17:
Holmberg teaches on the inventions of claims 1, 9 as described.
Holmberg teaches wherein the device for stream merging is comprised in a merge server.   ([0054] Content that is mixed to form group audiovisual performance mix 111 is captured, in the illustrated configuration, in the context of karaoke-style performance capture wherein lyrics 102, optional pitch cues 105 and, typically, a backing track 107 are supplied from content server 110 to either or both of current guest device 1 0lA and current host device 101B.  Claim 47:  receiving at the second device, a media encoding of a mixed audio performance … mixing the captured second performer vocal audio with the received mixed audio performance to provide a broadcast mix that includes the captured vocal audio of the first and second performers and the backing audio track without apparent temporal lag therebetween; and supplying the broadcast mix to a service platform configured to livestream the broadcast mix to plural recipient devices constituting an audience.)  The content server (second device as claimed in Claim 47) is a merge server and comprises components for stream merging.

Regarding Claims 5, 18:
Holmberg teaches on the inventions of claims 4, 17 as described.
Holmberg teaches wherein obtaining a second speech stream and a first image comprises:  obtaining the second speech stream and the first image from the forward server, the forward server (ie. content server) further configured to forward image information and speech information of the second user.   ([0052] FIG. 1, iPhone™ handhelds available from Apple Inc. (or more generally, handhelds 101A, 101B operating as guest and host devices, respectively) execute software that operates in coordination with a content server 110 to provide vocal capture.  [0053] A current guest user of current guest device 101A contributes to the group audiovisual performance mix 111 that is supplied ( eventually via content server 110) by current host device 101B as live stream 122.  [0058] User vocals 103A and 103B are captured at respective handhelds 101A, 101B, and may be optionally pitch-corrected continuously and in real-time and audibly rendered mixed with the locally-appropriate backing track ( e.g., backing track 107A at current guest device 101A and guest mix 106 at current host device 101B).)  The content server obtains/receives and forwards/provides audio and image data between user devices.

Regarding Claims 6, 19:
Holmberg teaches on the inventions of claims 1, 9 as described.
Holmberg teaches wherein the second image comprises a target image and visual effect associated with the first speech stream, the target image configured to indicate the first user.   ([0026] In some embodiments, the method further includes dynamically varying in the broadcast mix at least visual prominence of one or the other of the first and second performers based on evaluation of a computationally audio defined feature of either or both of the first and second performer vocals. In some embodiments, the method further includes applying one or more video effects to the broadcast mix based, at least in part, on a computationally defined audio or video feature of either or both of the first and second performer audio or video.)  The first merged streaming data is the broadcast mix and it’s encoded with video feature of either or both first and second performer video to provide the second merged streaming data.

Regarding Claims 7, 20:
Holmberg teaches on the inventions of claims 1, 9 as described.
Holmberg teaches wherein after obtaining the second merged streaming data, the method further comprises:  sending the second merged streaming data to a streaming media server (ie. service platform).  (Claim 47:  receiving at the second device, a media encoding of a mixed audio performance … mixing the captured second performer vocal audio with the received mixed audio performance to provide a broadcast mix that includes the captured vocal audio of the first and second performers and the backing audio track without apparent temporal lag therebetween; and supplying the broadcast mix to a service platform configured to livestream the broadcast mix to plural recipient devices constituting an audience.)

Regarding Claims 13, 14, 21:
Holmberg teaches on the inventions of claims 1, 9, 10 as described.
Holmberg teaches wherein: both the first user and the second user are live streamer users;  or the first user is an audience user and the second user is a live streamer user.    ([0025] The received media encoding includes video that is performance synchronized with the captured first performer vocals.  [0029] The broadcast mix is presented as a vocal duet.  Audio of the live stream includes both conversational-type audio portions captured in correspondence with interactive conversation between the first and second performers.  [0045] Techniques have been developed to facilitate the livestreaming of group audiovisual performances.  Audiovisual performances including vocal music are captured and coordinated with performances of other users in ways that can create compelling user and listener experiences.)

Conclusion & Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to RACHEL J HACKENBERG whose telephone number is (571)272-5417. The examiner can normally be reached 9am-5pm M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Glenton B Burgess can be reached at (571)272-3949. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/RACHEL J HACKENBERG/Primary Examiner, Art Unit 2454

Read full office action

Prosecution Timeline

Sep 03, 2024

Application Filed

Jan 24, 2026

Non-Final Rejection — §102, §112 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/582,409

Patent 12587464

FAULT INJECTION CONFIGURATION EQUIVALENCY TESTING

2y 5m to grant Granted Mar 24, 2026

18/452,056

Patent 12580819

DETERMINING SERVICE GROUP CAPACITY BASED ON AN AGGREGATE RISK METRIC

2y 5m to grant Granted Mar 17, 2026

18/359,895

Patent 12500823

SYSTEM AND METHOD FOR ENTERPRISE - WIDE DATA UTILIZATION TRACKING AND RISK REPORTING

2y 5m to grant Granted Dec 16, 2025

18/199,110

Patent 12495001

CAPACITY AWARE LOAD PACKING FOR LAYER-4 LOAD BALANCER

2y 5m to grant Granted Dec 09, 2025

18/436,894

Patent 12470508

RESTRICTING MESSAGE NOTIFICATIONS AND CONVERSATIONS BASED ON DEVICE TYPE, MESSAGE CATEGORY, AND TIME PERIOD

2y 5m to grant Granted Nov 11, 2025

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

79%

Grant Probability

99%

With Interview (+26.4%)

2y 10m

Median Time to Grant

Low

PTA Risk

Based on 300 resolved cases by this examiner. Grant probability derived from career allow rate.