DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 01/06/2026 has been entered.
Response to Amendment
The amendment to the claims filed on 12/01/2025 complies with the requirements of 37 CFR 1.121(c) and has been entered.
Response to Arguments
Applicant's Arguments/Remarks filed 12/01/2025 (hereinafter Resp.) have been fully considered as follows:
Applicant’s main argument is that Yang (US 2020/0035215 Al) in view of Dai ("Semantic Coded Transmission ... ", arXiv:2112.03093v2) does not disclose the amended “selectively schedule, based on a network quality parameter and a scheduling configuration, a transmission” feature of the independent claims. Although the argument is moot in view of new grounds of rejection under Roessel et al., U.S. Patent Application Publication No. 2023/0094234 (hereinafter Roessel), it should be noted that Dai discloses enough information to allow a person of ordinary skills in the art to implement joint source-channel autoencoder wherein either the video signal, or the semantically encoded one or more data elements and the metadata are transmitted over the classical transmission channel (“a compatible scheme with current digital communications systems (e.g., 4G and 5G), the modular implementation of SCT retains the channel coding and digital modulation modules to transmit digital signals” – See §V(A.), ¶3 further shows that in cases of low SNR, where conventional schemes suffer from many errors at bit-level, the SCT technique outperforms the classic transmission – See, Fig. 4 shows a comparison between the perceived quality at a receiver decoding both classic and semantic encoded transmissions affected by the same level of noise, thus teaching a person of ordinary skills in the art a level of channel quality at which to switch between the two techniques; see also Fig. 5, showing that SCT is better for low resolution images but not for “high-resolution images due to the highly optimized source codes and ideal channel codes” – See §V(B.)(1.), thus teaching a person of ordinary skills in the art that encoding with classic techniques of high resolution images performs better that using semantic coding on high bandwidth channels). Because obtaining an indication of a network quality parameter is a technique known in the art, a person of ordinary skills in the art using the results disclosed in Dai would know how to configure the semantic scheduler to switch between the classic transmission and the SCT based on the indication. Therefor the argument is unpersuasive and also moot.
Claim Objections
Claim 2 objected to because of the following informalities: "The device of claim 2" should be "The device of claim 1". Appropriate correction is required.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claims 1-3, 5,7, 26-28, and 30-35, as amended, are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Roessel et al., U.S. Patent Application Publication No. 2023/0094234 (hereinafter Roessel).
Regarding Amended Claim 1, Roessel teaches a device comprising: a processor (“FIG. 1 shows a diagrammatic representation of hardware resources 100 including one or more processors (or processor cores) 110, one or more memory/storage devices 120, and one or more communication resources 130, each of which may be communicatively coupled via a bus 140” – See [¶0022]) configured to:
extract semantic information from received data (“Semantic source coding includes extracting semantics from the source information (source "material", 202)” – See [¶0032]), wherein the semantic information comprises information representing a first attribute and a second attribute that are detected from the received data (“subjects (persons, animals, objects, ... ) as well as action(s) detected and extracted from (a set of) classical video frame(s)” – See [¶0038], i.e., “subject/object” is a first attribute, and “action” is a second attribute), and
wherein the first attribute and the second attribute are distinct from syntactic or structural information with respect to the received data (“[s]emantic segmentation 308 includes dividing, by a data processing system, an image frame (for example) into portions” – See [¶0050] whereas “[o]bject recognition includes extracting objects” that “are recognizable things within the media” – See [¶0052] and “[a]ctivity recognition includes determining, by the data processing system, what the objects in the frame are doing” – See [¶0053] i.e., the two attributes in a video frame are different from the structural information of the frame);
generate one or more data elements based on the extracted semantic information for an instance of time (“the semantic transcript stream (STS) includes . . ., an actors and actions stream” and “a details stream (e.g., for each actor or action)” whereby “[e]ach actor or action anchor is associated with data such as 6 DoF data, time stamp data, and so forth” – See [¶0047], i.e., each stream represents the one or more data elements, e.g., the details stream comprises data elements based on the extracted objects/actors and actions for an instance of time which is a frame because “determining a quality level for a channel for a time period in which a video frame is being transmitted over the channel” – See [¶0088]),
wherein the one or more data elements comprise a first data element representing the first attribute and a second data element representing the second attribute (“a details stream (e.g., for each actor or action)” comprising “details which can be associated with the particular actor or action in a nested format” – See id.)
generate metadata associated with the generated one or more data elements (“[t]he story stream includes annotations, anchors, and metadata (e.g., privacy and authentication information) for the story,” e.g., “[e]ach actor or action metadata can include privacy information or protected asset access data, which can restrict access to a given asset for video reconstruction” – See id.)
selectively schedule based on a network quality parameter and a scheduling configuration (“[t]he sender can send various amounts of data depending on the QoS” – See [¶0059] and “[i]f there is a mismatch between the predicted and actual channel, the device resolves the mismatch by dropping STS elements ( e.g., if channel quality is unexpectedly poor) or adds video frames (e.g., if the channel quality is unexpectedly robust)” – See [¶0066], whereby “[s]teering the quality and depth of semantic extraction to generate the STS is based on the predictions of channel quality . . . includes but is not limited to, e.g., (1) dynamic inclusion or skipping of 1-level or 2-level streams (e.g. the skipping the 2-level actors' and action details stream), (2) steering the size of the key actor set, (3) controlling the amount of annotations for photo- and phono-realistic enhancements, and so forth” – See [¶0061] and Fig. 4 showing the scheduler 416, i.e., selectively schedule what is transmitted based on network quality, whereby Channel Aware Semantic Coding, “CASC includes assigning (e.g., aggregating) STS elements to semantic source coding and channel coding (SSCC)” – See [¶0063] wherein “STS elements are assigned to SSCC streams based on the actual channel status” in “an n-to-m QoS directives-to-SSCC streams mapping (where n>=m),” i.e., the scheduling configuration – See [¶0064]) a transmission of either:
the received data or the one or more data elements and the metadata (“the sender sends no video frames or a limited number of video frames. The receiver is configured to rebuild the video from the frames received or from the semantic data alone” – See [¶0043], i.e., “[t]he sender can send video of varying fidelity depending on the channel quality” so that “[i]f the channel quality is very poor, the sender sends only STS data,” and “[a]s the channel quality changes, the channel capacity varies such that the amount of data that can be sent can vary” – See [¶0044])
such that if the network quality parameter indicates sufficient quality in the communication channel for the transmission of the received data, the received data is scheduled for transmission (“Based on the determined QoS, a transmission level 504 is selected,” e.g., “[i]n a high quality mode of levels 504, the sender sends full compressed video” – See [¶0078]); and
encode a scheduling information indicating the scheduling configuration for the transmission (“[b]ased on the determined QoS, a transmission level 504 is selected” whereby “[t]he levels 504 can include a compressed video and STS mode” or “an STS-enriched transmission mode including more features than the basic STS elements of the lowest mode” and “[i]n a high quality mode of levels 504, the sender sends full compressed video” – See [¶0078] and Fig. 5, showing the levels, e.g., a “semantic coding level” may be “a 0-level story full-frame” or “a 1-level and 2-level actors' and action full-frames” – See [¶0045], whereby “STS element, such as stream type, stream level, full frame, delta frame, metaframe, annotations, and anchors, are generated and transmitted depending on the QoS” and “QoS directives correspond to multiple, e.g. 8, 16, or more, levels of robustness demand, or forward error correction (FEC) demand, or modulation and coding scheme (MCS) demand” – See [¶0062], i.e., the QoS directives determine the level of the semantic source coding and channel coding, SCCS, thus indicating the scheduling configuration for the transmission).
Therefore, Amended Claim 1 is anticipated by Roessel.
Regarding Claim 2, dependent from Amended Claim 1, Roessel further teaches the device of claim [[2]]1, wherein the processor is configured to selectively apply a first encoding configuration and a second encoding configuration to the one or more data elements based on the extracted semantic information wherein each encoding configuration comprises a predefined modulation coding scheme or a predefined transmit power configuration (“all levels' full frames a robust MCS 1” and “and for all levels' delta frames MCS 3” – See [¶0045] whereby “[a] full frame captures features, such as subjects (persons, animals, objects, ... ) as well as action(s) detected and extracted from (a set of) classical video frame(s)” and “interrelated key actors in story is included. For example, a subject's body pose, an action for the subject, and so forth” – See [¶0038] whereas “delta frame for STS indicates changes since a prior frame. This can indicate updates for the subjects of the frame. For example, a body pose update for an actor in the frame, and action update for the actor, and so forth” – See [¶0039]; conversely, “[f]or a 4K resolution at 30 fps source video having a raw bit stream . . . we use a higher MCS scheme” – See [¶0046]).
Therefore, Claim 2 is anticipated by Roessel.
Regarding Claim 3, dependent from Amended Claim 1, Roessel further teaches the device of claim 1, wherein the processor is configured to schedule the transmission of the metadata and the one or more data elements independently from each other based on a network quality parameter (“CASC includes selecting an amount of data to send (e.g., STS stream and video, video only, STS stream only, amount and frequency of optional sub-streams or actors captured, etc.) based on the quality of the channel that is experienced by the sender or receiver” – See [¶0048] whereby “the semantic transcript stream (STS) includes a story stream” that “includes annotations, anchors, and metadata ( e.g., privacy and authentication information) for the story,” – See [¶0047] e.g., “meta-frames that identify identities of subjects, producers, locations, and so forth” and “are subject to privacy controls” – See [¶0042] and when “the sender sends no video frames . . . [t]he receiver is configured to rebuild the video . . . from the semantic data alone” and “a library of objects from which to build the video” – See [¶0043], i.e., the metadata is sent independently of the one or more data elements, e.g., “a 1-level and 2-level actors' and action full-frames” – See [¶0045] and Fig. 5, showing 4 independent levels of depth of STS to transmit starting with semantic transcript, i.e., metadata of the story; the furthermore, “STS elements, such as stream type, stream level, full frame, delta frame, metaframe, annotations, and anchors, are generated and transmitted depending on the QoS” – See [¶0062] whereby “STS elements are automatically dropped based on respective priority values associated with each of the STS element” – See [¶0064], e.g., “dynamic inclusion or skipping of 1-level or 2-level streams (e.g. the skipping the 2-level actors' and action details stream)” – See [¶0061] i.e., the transmission of the metadata comprising the story line and the one or more data elements comprising details of the actors and actions and additional video frames is dome independently from each other based on a network quality parameter).
Therefore, Claim 3 is anticipated by Roessel.
Regarding Claim 5, dependent from Claim 3, Roessel further teaches the device of claim 3, wherein the processor is configured to prioritize the transmission of the one or more data elements relative to the transmission of the metadata based on the network quality parameter (“[t]he quality of service (QoS) metric is measured by one or both of the sender and receiver. The sender can send various amounts of data depending on the QoS” – See [¶0059] whereby “the device determines, based on the predicted channel quality, what depth of STS to transmit and what quality of video to transmit” and “there is a mismatch between the predicted and actual channel, the device resolves the mismatch by . . . add[ing] video frames (e.g., if the channel quality is unexpectedly robust” – See [¶0066] i.e., based on the network quality parameter, the one or more data elements containing the media segments of the two attributes is prioritized over sending just the semantic transcript/metadata).
Therefore, Claim 5 is anticipated by Roessel.
Regarding Claim 7, dependent from Claim 1, Roessel further teaches the device of claim 1, wherein the processor is configured to determine a first protection level for the one or more data elements (“for privacy protection, the receiving device side receives end-to-end encrypted data independently of whether these are semantic transcripts, semantically boosted compressed frames, or regularly compressed frames” – See [¶0070], i.e., a first level of protection);
wherein the processor is configured to determine a second protection level for the metadata (“The semantic fusion 316 includes determining consumer-specific rules 382 for transmitting data, annotating data, generating synthetic or enhanced data, and so forth,” e.g., “[r]ules 382c-d indicate that there is no authorization for semantic details to be retrieved for the objects boy and girl” – See [¶0056], i.e., a rule-based second level of protection applies to object metadata).
Therefore, Claim 7 is anticipated by Roessel.
Regarding Claim 26, dependent from Amended Claim 1, Roessel teaches the device of claim 1, wherein the one or more data elements and the metadata are scheduled for a transmission over a physical channel (“the device resolves the mismatch [in channel quality] by dropping STS elements (e.g., if channel quality is unexpectedly poor) or adds video frames (e.g., if the channel quality is unexpectedly robust). At step 414, the device generates the SSCC stream for transmitting to the receiving device. At step 418, the device transmits the STS stream over the channel based on the generated SSCC stream” – See [¶0066] and Figs. 4 and 5, i.e., “multiple quality level semantically enhanced video and audio compressed frames” are sent on the physical channel in addition to “the semantic transcript,” i.e., metadata – See [¶0067]) and further discloses use of “[t]he joint semantic source channel (JSSC) coding [which] refers to joint encoding and optimization of semantic source coding 216 and semantic channel coding 218” whereby a “Channel Aware Semantic Coding (CASC) enhances JSSC coding to create a combined coding phase for both the source 202 and channel 203” – See [¶0035] and Fig. 2, and further distinguishes, among the “channel coding 222 tasks,” the classical “joint source channel coding” and the “joint semantic source and channel coding” – See id.)
Roessel further teaches the encoded scheduling information (“semantic transcript stream (STS) represents what is occurring in a respective media stream,” i.e., “a stream of frames, where the frames' content is captured in a computer-readable notation” e.g., “one or more of an annotated graph, mathematical categories and operators, a formal computer language, or a formalized natural language represented as text” wherein “CASC is illustrated with a formalized natural language represented as text” – See [¶0036] indicating “adaptive extraction of semantics, generation and QoS annotation of the STS and creation of semantic source and channel coding SSCC streams” – See [¶0037] and the example in Fig. 3D).
Roessel further teaches a semantic channel (“[a] semantic channel refers to semantic noise measured against semantic metrics 204. The semantic channel includes measuring correctness and consistency metrics” – See [¶0032] which are “metrics related to the correct interpretation and reconstruction of media content” – See [¶0032]).
Roessel further teaches wherein the encoded scheduling information is scheduled for a transmission over a semantic channel (in “the reference model 200, formal informational or media source content to be sent across a wireless link is subject to semantic source coding by eliminating semantic redundancy and to semantic channel coding by adding redundancy to reduce semantic noise” – See [¶0228], i.e., a CASC generated STS stream containing the encoded scheduling information of the semantic elements is jointly semantic source-channel encoded for the “semantic channel” between the transmitter and the receiver).
Therefore, Claim 26 is anticipated by Roessel.
Regarding Claim 27, dependent from Amended Claim 1, Roessel further teaches the device of claim 1, wherein the metadata further comprises at least one of syntactic information of the received data or structural information of the received data (“the semantic transcript stream (STS) . . . includes annotations, anchors, and metadata (e.g., privacy and authentication information)” – See [¶0047] and further includes “semantic segmentation” – See [¶0048], whereby “[s]emantic segmentation 308 includes dividing, by a data processing system, an image frame (for example) into portions” e.g., “selection of foreground and background” – See [¶0050] and Fig. 3B, i.e., the metadata includes structural information of the received data: frames and segments of a frame).
Therefore, Claim 27 is anticipated by Roessel.
Regarding Claim 28, dependent from Amended Claim 1, Roessel further teaches the device of claim 1, wherein the extracted semantic information comprises intermediate information which requires further processing to obtain the information representing the first attribute and the second attribute (e.g., an object as a first attribute is associated with a pose, and “[o]bject poses include identifying a position and orientation of an object that is recognized within the image frame. Pose recognition may include translation and/or rotation of objects that are extracted. If the object is extracted, additional data is used to generate a representation of other poses of the object within the frame” – See [¶0052]; and for an action, as a second attribute, “[a]nnotations, other objects, the scene, and other information can provide context to actions being performed by objects within an image frame. Annotations are provided by the data processing system to describe the objects, their poses, and their activities” – See [¶0053]).
Therefore, Claim 28 is anticipated by Roessel.
Regarding Claim 30, dependent from Amended Claim 1, Yang further teaches The device of claim 1, wherein the received data comprises speech data wherein the first attribute or the second attribute comprises one or more words (“The data processing system performs audio analysis 312. The audio processing includes speech and sound recognition, and generating annotations to describe these sounds” whereby “speech and sound recognition includes identifying a source of the sound, such as a particular person, a type of animal, a type of vehicle, and so forth. Machine learning models can be used to extract and annotate sounds” – See [¶0054] and Fig. 3, i.e., annotated sounds/words may be attributed to an object, i.e., a first attribute or an action, i.e., a second attribute)
and wherein the metadata comprises information indicating at least one of a detected emotion, a words per minute metric, or a pause between the one or more words (“the synthetic facial, lips, eye, hand, and body expressions stream representing the recreated producer side are optimized based on the sync markers anchors in the semantic transcript” whereby “[o]ptimization means that eventually, emotions, visual and audio expressions or scenery from the producer side is dropped at the consumer side to avoid ambiguity. Markers Annotations for heard audio in the semantic transcript are used to potentially drop producer's facial, lips, eye, hand, and body expressions related to heard audio, such as when conversational end-to-end latency exceeds 100 milliseconds” – See [¶0073], i.e., emotion and speech related metrics are included in the received metadata because “audio markers annotations and/ or audio embeddings added to the semantic transcript” so that to synthesize “the producer- side emotional elements in the voice from the audio markers annotations and/or audio embeddings added to the semantic transcript” – See [¶0076]).
Therefore, Claim 30 is anticipated by Roessel.
Regarding Claim 31, dependent from Amended Claim 1, Roessel further teaches the device of claim 1, wherein the received data comprises one or more images (“The source 202 material includes media such as video or images” – See [¶0028] and Fig. 2)
wherein the one or more data elements comprises information indicating at least one of detected objects, background, or kinematic attributes of the detected objects (“the semantic transcript stream (STS) includes . . ., an actors and actions stream” and “a details stream (e.g., for each actor or action)” whereby “[e]ach actor or action anchor is associated with data such as 6 DoF data, time stamp data, and so forth” – See [¶0047] and further includes “object pose and activity recognition, . . . relationship inference and composition” – See [¶0048]),
and wherein the metadata comprises information indicating at least one of weather condition, lighting, or color of the detected objects (“receiver is configured to perform decompression of the semantic transcript, prepare models and individual assets for synthesis, . . . , perform spatial-temporal, physically consistent renders, perform texture synthesis, and render lighting or atmosphere” – See [¶0048] and Fig. 3D showing “lightning” as attribute of the scene, and “the process includes a routine that extracts the semantic transcript for the producer's facial, lips, eye, hand, and body expressions . . . by extracting semantic items such as persons, objects, sites/scenery, events, actions, interactions and relations between semantic items” – See [¶0067], i.e., the received STS script/metadata includes texture and lightning information).
Therefore, Claim 31 is anticipated by Roessel.
Regarding Claim 32, dependent from Amended Claim 1, Roessel further teaches the device of claim 1, wherein the metadata comprises context information of the received data (“the spatial-temporal and physical consistency extends to the level of contextual/annotations information in the semantic transcript stream” – See [¶0072] whereby “[a]nnotations, other objects, the scene, and other information can provide context to actions being performed by objects within an image frame” – See [¶0053]).
Therefore, Claim 32 is anticipated by Roessel.
Regarding Claim 33, dependent from Amended Claim 1, Roessel further teaches the device of claim 1, wherein the one or more data elements and the metadata comprise information of a plurality of instances of time (“0-quality level story semantic transcript. STS comprises of a sequence of full and delta frames,” i.e., a plurality of instances of time are comprised in the metadata, and “[e]ach of the actors' and action stream (with a configurable frame rate [fps]) are included, such as for 1-quality level story semantic transcript” – See [¶0038] whereby “a 1-level and 2-level actors' and action full-frames” may include a “15 fps for 952 byte” video scene– See [¶0045]).
Therefore, Claim 33 is anticipated by Roessel.
Regarding Amended Claim 34, Roessel teaches a non-transitory computer-readable medium comprising instructions (“example embodiments, able to read instructions from a machine-readable or computer-readable medium (e.g., a non-transitory machine-readable storage medium) and perform any one or more of the methodologies discussed” – See [¶0022] and Fig. 1) which, if executed by a processor, cause the processor to: perform the same steps as claimed by the device in Amended Claim 1, with no other limitations.
Therefore, Amended Claim 34 is anticipated by Roessel.
Regarding Claim 35, dependent from Amended Claim 34, Roessel further teaches the non-transitory computer-readable medium of claim 34, wherein the one or more data elements and the metadata are scheduled for a transmission over a physical channel (“A communication link includes a physical channel that can introduce channel noise measured against link metrics 208 to the transmit signal” – See [¶0032] and includes “link metrics 208 include BLER or BER” – See [¶0034] and Fig. 2, and “[t]he device predicts a future achievable UL data rate (or the related UL grant profile in case of 5G NR)” – See [¶0060] so that the CASC’s “[s]teering the quality and depth of semantic extraction to generate the STS is based on the predictions of channel quality” whereby “[t]he elements of STS are linked to QoS based on the QoS prediction” e.g., “QoS directives correspond to multiple, e.g. 8, 16, or more, levels of robustness demand, or forward error correction (FEC) demand, or modulation and coding scheme (MCS) demand” – See [¶0061] and “the device determines, based on the predicted channel quality, what depth of STS to transmit and what quality of video to transmit” – See [¶0066], i.e., the one or more data elements and the metadata are scheduled for a transmission over a physical channel based on predicted channel quality).
Therefore, Claim 35 is anticipated by Roessel.
In sum, Claims 1-3, 5,7, 26-28, and 30-35, as amended, are rejected under 35 U.S.C. §102(a)(2) as anticipated by Roessel.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Roessel as applied to claim 3 above, and further in view of Park et al., U.S. Patent Application Publication No. 2016/0241890 (hereinafter Park).
Regarding Claim 4, dependent from Claim 3, Roessel further teaches the device of claim 3, wherein the processor is configured to schedule the transmission of the metadata in intervals (e.g., the device is “determining a quality level for a channel for a time period in which a video frame is being transmitted over the channel; determining, based on the quality level, one or more semantic elements to include in a semantic transcript stream(STS),” – See [¶0098] i.e., metadata is transmitted at video frame intervals).
Although Roessel teaches, e.g., in Fig. 5, transmission of metadata only, i.e., the STS semantic transcript, when the channel quality is low – See [¶0106] (“transmitting the STS data only without video frame data”) and the one or more data elements when the channel quality is higher – See [¶0107] (“transmitting the encoded video frame comprises transmitting the STS data with a full compressed video frame or by attaching one or more portions of the video frame”), thus intimating that the one or more data elements are transmitted more frequently than the transmission of the metadata when the channel quality is higher, Roessel does not explicitly teach wherein the processor is configured to schedule the transmission of the one or more data elements more frequently than the transmission of the metadata.
Park, like Roessel, discloses a media streaming device sending metadata of a media storyline separately from the one or more data elements ( “an apparatus for transmitting media data via streaming service” by “dividing the media data into a plurality of media data fractions,” generating “file formats including each of the media data fractions,” i.e., “a first file format and second file formats, wherein the first file format includes metadata of the entire media data for processing the media data fractions in the second file formats” and “transmitting the generated first and second file formats” – See [¶0017]) encoding a scheduling information indicating the scheduling configuration for the transmission (“the apparatus for transmitting broadcast signals can receive management information about the configuration of each stream constituting the input signal and generate a final physical layer signal with reference to the received management information” – See [¶0196] and Fig. 1).
Park also teaches wherein the processor is configured to schedule the transmission of the metadata in intervals (“conventional file format #1” contains “mandatory boxes [that] include information needed for processing file format #2,” i.e., contains metadata – See [¶1391]; “[t]his file format is transmitted at a specific interval or transmitted whenever media data is transmitted”– See [¶1420] whereby the format #1 is similar to the semantic transcript/metadata in Roessel).
Park further teaches wherein the processor is configured to schedule the transmission of the one or more data elements more frequently than the transmission of the metadata (“include meta data . . . in file format 1 . . . correspond[ing] to the conventional file format #1” – See [¶1414] and “include segmented data in file format 2, file format 3, ... , file format N . . . correspond[ing] to the conventional file format #2” – See [¶1415]; “[t]he file formats may be sequentially transmitted in order of file format 1, file format 2, ... , file format N” – See [¶1416]; wherein “other file formats (file format #2) can be consumed only when a file format such as the file format #1 is received” – See [¶1420] and Fig. 143; therefore, the transmission of the one or more data elements/media segments in file format#2–N occurs more frequently than the transmission of the metadata in file format#1).
Thus, Roessel and Park each discloses an apparatus generating one or more data elements and metadata for transmission based on a scheduling configuration for the transmission. A person of ordinary skill in the art before the effective filing date of the claimed invention would have understood that the independent metadata and media data encoding and transmission in Park, whereby the media data encoding files are transmitted more often than the metadata files, could have been substituted in for the scheduled transmission in Roessel because they both provide encoding of metadata and media data for a receiving device. Furthermore, a person of ordinary skill in the art would have been able to carry out the substitution through techniques known in the art. Finally, the substitution achieves the predictable result of improving the apparatus in Roessel with the management information about the configuration of each stream constituting the input signal and generate a final physical layer signal with reference to the received management information.
Therefore, Claim 4 is obvious over Roessel in Park.
Claims 8, 9, and 29 are rejected under 35 U.S.C. 102(a)(2) as anticipated by Roessel or, in the alternative, under 35 U.S.C. 103 as obvious over Roessel in view of Dai et al., "Semantic Coded Transmission: Architecture, Methodology, and Challenges," Source: arXiv:2112.03093v2, December 7, 2021 (available at https://doi.org/10.48550/arXiv.2112.03093) (hereinafter Dai).
Note: When a function is not explicitly disclosed by the reference, the examiner may make a rejection under both 35 U.S.C. 102 and 103. "There is nothing inconsistent in concurrent rejections for obviousness under 35 U.S.C. 103 and for anticipation under 35 U.S.C. 102." In re Best, 562 F.2d 1252, 1255 n.4, (CCPA 1977). See MPEP §2112 (III). Here, Roessel teaches all the limitations of Claim 8 but for the machine learning model working on video frames (it discloses ML working on audio/sounds in the frames).
Regarding Claim 8, dependent from Amended Claim 1, Roessel further teaches the processor configured to:
extract the semantic information using a machine learning model configured to receive an input comprising the received data and provide an output comprising the extracted semantic information (“For the reference model 200, formal informational or media source content to be sent across a wireless link is subject to semantic source coding by eliminating semantic redundancy and to semantic channel coding by adding redundancy to reduce semantic noise” – See [¶0028] and Fig. 2, wherein “[s]emantic source coding includes extracting semantics from the source information (source "material", 202)” – See [¶0032] and “[c]orrectness metrics use ontology information 204 to determine how successful the semantic encoding/decoding have been” with “refer[ence] to factual information represented in the source media” – See [¶0029] and “consistency metrics measure object relationships in the media, details for the depicted scene (e.g., an environment or location), spatial-temporal information for the scene, and physical or action information for the scene)” – See [¶0030] whereby both semantic source coding and semantic channel coding are operations relying heavily on machine learning, as known in the art – See Dai infra),
wherein the machine learning model is further configured to provide the output comprising the metadata (e.g., “audio processing includes speech and sound recognition, and generating annotations to describe these sounds” whereby “speech and sound recognition includes identifying a source of the sound, such as a particular person, a type of animal, a type of vehicle, and so forth. Machine learning models can be used to extract and annotate sounds,” i.e., output metadata – See [¶0054] and Fig. 3A showing that “[s]ematic fusion 316 includes generating a transcript 380 for all the semantic data to be encoded and transmitted to another device” including “ordering the annotations” and “consumer-specific rules 382 for transmitting data, annotating data, generating synthetic or enhanced data, and so forth” – See [¶0056]);
wherein the machine learning model is further configured to provide the output comprising a protection level or a priority level for the extracted semantic information (“The semantic metrics 204 include rules for rule-based encoding and decoding. These rules include measures for ethics, aesthetics, regulations, and other such rules related to personally identifiable information (PII) generated depictions of individuals in a synthetic video, and so forth” – See [¶0030]).
Therefore, Claim 8 is anticipated by Roessel. However, because Roessel only describes that “[m]achine learning models can be used to extract and annotate sounds” – See [¶0054] and remains silent as to the inherent machine learning model used in video semantic processing, as known in the art, Dai, disclosing, like Roessel, a semantic coded transmission (SCT) system based on semantic source coding and semantic channel coding, specifically discloses the machine learning function extracting semantic information from video frames (“recent advances of semantic information processing techniques in natural language processing (NLP) and computer vision (CV) greatly allow us to integrate semantic aspects into communication systems to further improve the end-to-end transmission efficiency” – See §II, col.1, at page 2; and that “most semantic information processing methods rely heavily on deep learning techniques” – See §III, col.1, at page 3, showing in Fig. 2, a “semantic analysis transform module that operates on the source data in some way to extract semantic features and produce semantically annotated messages” wherein the source contains “(c) Image defined by three-dimensional multiplexed vectors; (d) Video realized by a series of correlated image frames in chronological order; (e) Various combinations above types, for example, multimedia data containing video and an associated audio channel” – See §III, col.1, at page 2, wherein the transform module is a type of machine learning model based on Artificial neural networks (ANN) whose “nonlinear properties can approximate arbitrary source distributions and extract the source semantic features,” e.g., “[i]n the SCT system, the ANN-based semantic analysis transform outputs the semantically annotated feature map in the latent space”– See §IV, col.2, at page 3).
Thus, Roessel and Dai each teaches extraction, encoding and transmission of semantic information as a sequence of semantic feature/attribute vectors mapped on different semantic objects in the source content. A person of ordinary skill in the art before the effective filing date of the claimed invention would have understood that the semantic coded transmission method based on extracting semantic information from video frames using ANN based nonlinear transformers, as taught in Dai, could have been substituted in for the module performing semantic extraction from video frames of a media stream, as disclosed in Roessel because both serve the purpose of extracting semantically encoded information from a media source. Furthermore, a person of ordinary skill in the art would have been able to carry out the substitution through techniques known in the art. Finally, the substitution achieves the predictable result of improving the end-to-end transmission efficiency, as taught in Dai.
Therefore, Claim 8 is anticipated by Roessel, or, in the alternative, obvious over Roessel in view of Dai.
Regarding Claim 9, dependent from Claim 8, Roessel further teaches the device of claim 8, wherein the processor is configured to: provide application layer functions implemented at application layer according to a communication reference model (e.g., Fig. 5 depicts a communication reference model wherein layer 502 represents the application layer that “performs semantic extraction as described herein” – See [¶0066] including “generation and QoS annotation of the STS” – See [¶0037], i.e., application layer functions);
provide network access layer functions implemented at a lower layer that is lower than the application layer according to the communication reference model (also in Fig. 5, layer 506 represents a lower layer, i.e., “communication resources 130 [which] may include interconnection or network interface components or other suitable devices to communicate with one or more peripheral devices 104 or one or more databases 106 via a network 108” – See [¶0025] and Fig. 1, whereby each “communication link includes a physical channel that can introduce channel noise measured against link metrics 208 to the transmit signal” – See [¶0032]; and layer 504 further includes network access layer functions whereby “[b]ased on the determined QoS, a transmission level 504 is selected,” e.g., “for a low quality channel, the transmission level can include a basic STS mode” or “include a compressed video and STS mode” so that “a subject's face may be received from the video, but parts of the subject's body or clothing can be built by the receiver” and “[i]n a high quality mode of levels 504, the sender sends full compressed video” – See [¶0078]);
wherein the machine learning model is configured to receive the input from the application layer functions (e.g., a “Channel-Aware Semantic Coding (CASC)” enhancing “[t]he joint semantic source channel (JSSC)” – See [¶0035] “includes selecting an amount of data to send ( e.g., STS stream and video, video only, STS stream only, amount and frequency of optional sub-streams or actors captured, etc.) based on the quality of the channel that is experienced by the sender” after the operations of “semantic segmentation, object pose and activity recognition, speech and sound recognition, relationship inference and composition, semantic fusion and transcript composition, and compression of the semantic transcript” have been performed on the source material – See [¶0048]);
wherein the processor is configured to provide scheduling information to schedule the transmissions to the network access layer functions (the scheduling information, i.e., “STS element, such as stream type, stream level, full frame, delta frame, meta-frame, annotations, and anchors, are generated and transmitted depending on the QoS,” e.g., “QoS directives correspond to multiple, e.g. 8, 16, or more, levels of robustness demand, or forward error correction (FEC) demand, or modulation and coding scheme (MCS) demand” – See [¶0062], i.e., network access layer functions);
wherein the machine learning model is configured to receive a cross-layer information comprising at least a portion of the input from the application layer functions (the “STS is a flexible, hierarchical, and structured dataset that includes a formal, computer-readable format” with a “structural depth of . . . the STS data frame rate (e.g., an amount of semantic information present in the STS)” –See [¶0038], i.e., a portion of the input from the application layer functions, whereby “the quality and depth of semantic extraction to generate the STS is based on the predictions of channel quality” – See [¶0061], i.e., cross-layer information from the lower layers, e.g. network access functions, and “performs semantic fusion” which “includes generating a transcript 380 for all the semantic data to be encoded and transmitted to another device . . . determining consumer-specific rules 382 for transmitting data, annotating data, generating synthetic or enhanced data, and so forth” – See [¶0056] and Fig. 3A)
wherein the machine learning model is configured to provide a cross-layer information comprising the scheduling information to schedule the transmissions to the lower layer functions (the “semantic encoding 302 includes compressing the semantic transcript 380 and sending the transcript over a channel 306” – See [¶0057] and Fig. 3A, whereby “[t]he semantic coding module 210 includes . . . a joint source channel coding module 214. The joint semantic source channel (JSSC) coding refers to joint encoding and optimization of semantic source coding 216 and semantic channel coding 218” – See [¶0035] and Fig. 2, and includes “assigning (e.g., aggregating) STS elements to semantic source coding and channel coding (SSCC) streams” – See [¶0063] based on QoS determination/directive from the CASC – See [¶0062], i.e., the cross-layer information comprising the scheduling information to schedule the transmission when “the device transmits the STS stream over the channel based on the generated SSCC stream” – See [¶0066] and Figs. 4 and 5).
Because Roessel is silent on the machine learning nature of the joint semantic source channel (JSSC) coding as known in the art, Dai, using “biased coding [that] involves both source and channel coding such that the specific coding scheme of each SFV is in the light of its semantic importance” – See § IV,col.2:¶1, at page 4, similar to Roessel, further teaches the joint semantic source-channel encoder using machine learning1 to provide cross-layer information comprising the scheduling information scheduled for a transmission over a physical semantic channel, i.e., lower layer (using “the joint source-channel coding”(JSCC) that “maps the SFVs to the channel input symbols directly” whereby “[t]he joint coding rate of each SFV should be aligned with their semantic importance” and the JSCC “can parameterize the encoder and decoder functions by two jointly trained ANNs” that “together constitute an autoencoder” , i.e., the ML model parameters/weights values, knowledge bases/ontologies, and the semantic importance of semantically encoded objects, i.e., encoded scheduling information, are shared between the transmitter and the receiver as shown in Fig. 2 and as known in the art2).
Thus, Roessel and Dai each teaches JSCC technique applied to wireless communication of media. A person of ordinary skill in the art before the effective filing date of the claimed invention would have understood that the ANN-based joint semantic source-channel coding in Dai, whereby the encoded scheduling information, including the semantic importance of the extracted semantic objects being sent over a semantic channel implemented by the autoencoder, could have been combined with the Channel-Aware Semantic Coding (CASC) adaptive generation and QoS annotation of the STS and creation of semantic source and channel coding SSCC streams in Roessel, because CASC naturally enhances the joint semantic source-channel encoding trained on network layer parameters, as taught by Dai. Furthermore, a person of ordinary skill in the art would have been able to carry out the combination through techniques known in the art. Finally, the combination achieves the predictable result of combining the SCT better performance for low resolution images, with the classic image transmission encoders that are better for high-resolution images due to the highly optimized source codes and ideal channel codes, as shown in Fig. 5 of Dai, and taught by the combination of the JSCC in Roessel and Dai.
Therefore, Claim 9 is anticipated by Roessel, or, in the alternative obvious over Roessel in view of Dai.
Regarding Claim 29, dependent from Amended Claim 1, Roessel further teaches the device of claim 9, wherein the network access layer functions include a medium access control (MAC) layer function or a physical (PHY) layer function (“a monitoring and prediction component” at “[t]he sending device is configured to monitor events such as call drops, tracking frequency and size of upload (UL) grants, tracking measurements for handovers, reference symbols, and so forth” – See [¶0060] and the network scheduler 416 in Fig. 4 executing MAC and PHY layers functions).
Therefore, Claim 29 is anticipated by Roessel or, in the alternative, obvious over Roessel in view of Dai.
In sum, Claims 8, 9, and 29, are rejected under 35 U.S.C. §102(a)(2) as anticipated by Roessel or, in the alternative, under 35 U.S.C. §103 as obvious over Roessel in view of Dai.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Yang et al., U.S. Patent Application Publication No. 2020/0035215 discloses a semantic extraction apparatus and method based on machine learning and encoding of the extracted sematic information for transmission over a wireless communication system;
Roessel et al., U.S. Patent Application No. US 2024/0121453 teaches the same as Roessel above and in addition determining an expected power consumption for encoding video data that includes semantic features;
Lee et al., U.S. Patent Application No. 2025/0015929 discloses a semantic-based wireless communication system in the Shannon-Weaver framework known in the art;
Lee et al., U.S. Patent Application No. 2025/0232179 discloses a method for a transmitting end to transmit semantic data in a semantic wireless communication system wherein the semantic data is mapped to satisfy a required transmission power limitation condition;
Liroli et al, EP Patent Application No. EP 1619839A1 discloses a "semantic-aware" technique located in a radio resource manager to detect the existence of critical conditions of the radio channel preventing the attainment of a given quality-of-service level and selectively drop data units marked as less important. However, Liroli does not disclose the extraction of semantic information from the raw data and encoding for transmission of such information;
Vaidya et al., U.S. Patent Application No. 2012/0099497 teaches a frame format for exchanging information between the nodes regarding transmission based on network quality parameters;
Cristache, U.S. Patent Application No. 2024/0359318 (and related patent from the same family) teaches a device networking device and methods for semantic fluxes for management of semantic information within a network;
Shi et al., China Patent Application Publication CN 112800247, teaches a semantic encoding/decoding method based on knowledge graph sharing, equipment and a communication system, and belongs to the field of wireless communication;
Masaki et al., JP Patent Application No. JP2005322173 discloses a metadata generating device and semantic information extraction means;
Shi et al., “A new communication paradigm: from bit accuracy to semantic fidelity,” January 2021, available at https://arxiv.org/abs/2101.12649, discloses how to deploy semantics to solve the spectrum and power bottleneck and proposes a first understanding and then transmission framework with high semantic fidelity;
3GPP TR 23.700-91 V17.0.0 (2020-12) “Technical Report Technical Specification Group Services and System Aspects; Study on enablers for network automation for the 5G System (5GS); Phase 2 (Release 17)” published December 17, 2020, discloses integration of AI/ML with network functions in the case of Network Data Analytics Function and functional split in the 5G system reference architecture;
3GPP TS 23.501 V16.7.0 (2020-12), "Technical Specification Group Services and System Aspects; System architecture for the 5G System (5GS); Stage 2 (Release 16)," published December, 17, 2020;
Popovski et al., “Semantic-Effectiveness Filtering and Control for Post-5G Wireless Connectivity,” July 2019, available at https://arxiv.org/abs/1907.02441 teaches an Augmented Protocol Architecture comprising a semantic-effectiveness (SE) plane covering the functionalities concerning both semantic, i.e., the meaning conveyed by the transmitted symbols should accurately reflect the intentions of the sender, e.g., the application layer, and communication effectiveness problems;
Kountouris et al., “Semantics-Empowered Communication for Networked Intelligent Systems,” March 2021, available at https://arxiv.org/abs/2007.11579, describes a communication paradigm shift, which makes the Semantics of Information, i.e., the significance and the usefulness of messages with respect to the goal of data exchange, the underpinning of the entire communication process;
Bourtsoulatze et al., "Deep Joint Source-channel Coding for Wireless Image Transmission," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 2019, pp. 4774-4778, doi: 10.1109/ICASSP.2019.8683463;
Liu et al., “Data-Importance Aware User Scheduling for Communication-Efficient Edge Machine Learning,” IEEE Transactions On Cognitive Communications And Networking, Vol. 7, No. 1, March 2021, pp.265-278;
Zhou et al., “Semantic Communication with Adaptive Universal Transformer,” 29 Nov 2021, available at arXiv:2108.09119v3;
Lan et al., “What Is Semantic Communication? A View on Conveying Meaning in the Era of Machine Intelligence,” Review Paper, Journal of Communications and Information Networks, Vol.6, No.4, Dec. 2021.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LUCIA GHEORGHE GRADINARIU whose telephone number is (571)272-1377. The examiner can normally be reached Monday-Friday 9:00am - 5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Joseph AVELLINO can be reached at (571)272-3905. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/L.G.G./ Examiner, Art Unit 2478
/KODZOVI ACOLATSE/ Primary Examiner, Art Unit 2478
1 To be noted that Dai describes machine learning/ANN model used in the semantic analysis transform module that operates on the source data to extract semantic objects and machine learning/ANN-based joint source-channel coding; Dai’s implementation is within the meaning of a distributed machine learning disclosed in the present Specification, comprising “a first portion of a distributed machine learning model . . . configured to obtain a semantic information based on received data” and “a second portion of the distributed machine learning model . . . trained based on one or more communication medium parameters with respect to the communication medium” – See Spec. [¶0154].
2 The modelling of end-to-end communication systems using the autoencoder architecture based on joint source and channel coding (JSCC), as applied to for wireless image transmissions, is better explained in Fig. 1(b), Bourtsoulatze et al.,"Deep Joint Source-channel Coding for Wireless Image Transmission," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 2019, pp. 4774-4778, doi: 10.1109/ICASSP.2019.8683463.