Last updated: May 29, 2026
Application No. 18/168,891
VIRTUAL REFERENCE FRAMES FOR IMAGE ENCODING AND DECODING

Final Rejection §102§103
Filed
Feb 14, 2023
Examiner
TRAN, THAI Q
Art Unit
2484
Tech Center
2400 — Computer Networks
Assignee
Qualcomm Incorporated
OA Round
3 (Final)
Interview Optional

— -4.1% interview lift. Interview lift (-4.1%) is below the 15.0% threshold. A written response is recommended.
Based on 37 resolved cases, 2023–2026
Examiner Intelligence

TRAN, THAI Q View full profile →
Grants only 27% of cases
Career Allowance Rate
10 granted / 37 resolved
-31.0% vs TC avg
Minimal -4% lift
Without
With
+-4.1%
Interview Lift
resolved cases with interview
Typical timeline
4y 9m
Avg Prosecution
6 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§103
82.5%
+42.5% vs TC avg
§102
12.6%
-27.4% vs TC avg
§112
1.9%
-38.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 37 resolved cases
Office Action

§102 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed Oct. 01, 2025 have been fully considered but they are not persuasive.
In re pages 7-8, applicant respectfully submits that the cited portions of Jiang do not disclose or suggest generating "a virtual reference frame based on synthesis support data included in the bitstream" and/or generating "a decoded version of the image frame based on the virtual reference frame," as in claim 1. 
As a threshold matter, Applicant respectfully notes that the cited portions of Jiang appear to only be applicable to an encoding process, not a decoding process. (see Office Action, p. 3, citing Jiang, [0065, 0066, 0091, 0092].) In an effort to best address the Office's arguments, Applicant assumes that the Office intends for the cited portions of Jiang to apply equally to both the encoding and decoding processes. Additionally, the video encoding and decoding in Jiang is based on "determining a set of facial landmark features of the at least one face from the at least one frame of the video data, and coding the video data at least partly by a neural network based on the determined set of facial landmark features." (Jiang, Abstract.) Although Jiang states that a neural network can be used to code video data based on facial landmark features, there is no indication in the cited portions of Jiang of any teaching or suggestion of generating "a virtual reference frame based on synthesis support data included in the bitstream" and/or generating "a decoded version of the image frame based on the virtual reference frame," as in claim 1. 
The Office argues that Jiang discloses generating a virtual reference frame, citing to Jiang's discussion of predictive pictures and the use of facial landmarks to generate those predictive pictures. (Office Action, p. 3, citing Jiang, [0065, 0066, 0091, 0092].) However, nothing in the cited portions of Jiang teach or suggest generating a virtual reference frame based on synthesis support data included in the bitstream. By Jiang's own terms, its reference frames are "previously-coded frames from the video sequence." (Jiang, [0057]. Jiang then describes using those reference frames as "prediction reference(s)" when encoding facial landmark data along with the video frames. (Id.) Jiang takes a different approach to that of claims 1: Jiang generates traditional reference frames and communicates those traditional reference frames along with "predictive pictures" and facial landmark data as part of an encoded bitstream. Claim 1 uses synthesis support data, that can include facial landmark data, to generate virtual reference frames. Thus, the cited portions of Jiang fail to disclose generating a virtual reference frame based on synthesis support data included in the bitstream, as in claim 1. 
Applicant also respectfully notes that the Office offers no argument that Jiang discloses generating a decoded version of the image frame based on the virtual reference frame, as in claim 1. (see Office Action, pp. 2 and 3.) For at least these reasons, the cited portions of Jiang do not disclose each and every element of claim 1. Applicant therefore respectfully requests reconsideration and allowance of claim 1. 
For analogous reasons, the cited portions of Jiang do not disclose each and every element of independent claims 13, 14, and 30, which each recite elements similar to those discussed above. Applicant therefore respectfully requests reconsideration and allowance of independent claims 13, 14, and 30. Claims 3-10, 12, and 16-28 are also allowable at least because each depends from an allowable claim and each recites additional patentable subject matter. Applicant therefore respectfully requests reconsideration and allowance of claims 3-10, 12, and 16-28.
In response, the examiner respectfully disagrees. It is noted that Fig. 12 of JIANG et al. shows both encoder and decoder. Page 8, paragraph #0092 was cited in the rejection of claim 1 to show “a decoded version of the image frame based on the virtual reference frame” (see page 8, paragraph #0092, "At S114, the decoded EFA features F.sub.b,1, F.sub.b,2, .., and the up-sampled sequence X=x.sub.1, x.sub.2, … are aggregated together by a Fusion module 139 to generate the final reconstructed video sequence X={circumflex over (x)}.sub.1, {circumflex over (x)}.sub.2, .., data 140. The Fusion module can be a small DNN, where for generating {circumflex over (x)}.sub.i at time stamp i, the Fusion module can use only L.sub.i, B.sub.i, andx.sub.ifrom the same time stamp, or use L.sub.i-n, …, L.sub.i+m, B.sub.i-n, …, B.sub.i+m, and use x.sub.i-n, …, x.sub.i+1 from a few neighbouring time stamps. … " ). Thus, both encoding and decoding of JIANG et al. were cited in the rejection of claim 1.
As discussed in the last Office Action, the P frame and B frame are encoded based on facial landmarks and motion vectors disclosed in paragraphs #0065, #0066, #0091, and #0092 of JIANG et al.. It is further noted that the claimed “a virtual reference frame based on synthesis support data included in the bitstream” is anticipated by at least the P frame and B frame encoded based on facial landmarks and motion vectors of JIANG et al.. Thus, JIANG et al. does disclose all the claimed limitations of claim 1.
In re page 8, applicant states that the remaining claims are allowable for the same reasons as discussed in claim 1 above.
In response, as discussed above claim 1, JIANG et al. discloses all claimed limitations of claim 1.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1, 3-10, 12-14, 16-28, and 30 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by JIANG et al. (US 2022/0217371 A1) as set forth in the last Office Action.
Regarding claim 1, JIANG et al. discloses a device (Fig. 13) comprising: one or more processors (see, page 1, paragraph #0006, "According to exemplary embodiments, there is included a method and apparatus comprising memory configured to store computer program code and a processor or processors configured to access ") configured to: obtain a bitstream corresponding to an encoded version of an image frame (see page 8, paragraph #0091, "On the decoder side, such as described for example with respect to the flowchart 1100 of FIG. 11 and various modules of FIG. 12, received encoded bitstreams, at S111, ..."); based on determining that the bitstream includes a virtual reference frame usage indicator, generate a virtual reference frame based on synthesis support data included in the bitstream, wherein the synthesis support data includes facial landmark data, motion-base data, or both the facial landmark data and the motion-base data, and wherein the facial landmark data indicates locations of facial features included in the image frame (see page 5, paragraphs #0065-#0066, "A Predictive picture (P picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most one motion vector and reference index to predict the sample values of each block" and "A Bi-directionally Predictive Picture (B Picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most two motion vectors and reference indices to predict the sample values of each block. Similarly, multiple-predictive pictures can use more than two reference pictures and associated metadata for the reconstruction of a single block", page 7, paragraph #0088, " According to exemplary embodiments, to locate a pre-determined set of facial landmarks for each detected face (e.g., landmarks around left/right eyes, nose, mouse, etc.). …”, page 8, paragraph #0091, "On the decoder side, and the decoded facial landmark features F.sub.l,1, F.sub.l,2, …, data 128. are further generated by using traditional motion interpolation or DNN-based frame synthesis methods based on x.sub.ki and x.sub.(k+1)i", and page 10, paragraph #0107, "Computer system 1400 …; virtual-reality glasses (not depicted), holographic displays and smoke tanks (not depicted)), and printers (not depicted)"); and generate a decoded version of the image frame based on the virtual reference frame (see page 8, paragraph #0092, "At S114, the decoded EFA features F.sub.b,1, F.sub.b,2, …, and the up-sampled sequence X=x.sub.1, x.sub.2, … are aggregated together by a Fusion module 139 to generate the final reconstructed video sequence X={circumflex over (x)}.sub.1, {circumflex over (x)}.sub.2, …, data 140. The Fusion module can be a small DNN, where for generating {circumflex over (x)}.sub.i at time stamp i, the Fusion module can use only L.sub.i, B.sub.i, andx.sub.ifrom the same time stamp, or use L.sub.i-n, , L.sub.i+m, B.sub.i-n, , B.sub.i+m, and use x.sub.i-n, , x.sub.i+1 from a few neighbouring time stamps. … " and page 8, paragraph #0094, "Also, according to exemplary embodiments, …, a fusion module, a compute adversarial loss module 241, a compute reconstruction loss module 242, a compute perceptual loss module 243, and the workflow 1300 also includes various data 221, 224, 225, 229, 228, 232, 233, 236, 238, and 240").  
Regarding claim 3. JIANG et al. also discloses wherein the bitstream indicates a first set of reference candidates that includes the virtual reference frame (see page 8, paragraph #0092, "At S114, the decoded EFA features F.sub.b,1, F.sub.b,2, .., and the up-sampled sequence X=x.sub.1, x.sub.2, … are aggregated together by a Fusion module 139 to generate the final reconstructed video sequence X={circumflex over (x)}.sub.1, {circumflex over (x)}.sub.2, …, data 140. The Fusion module , can be a small DNN, where for generating {circumflex over (x)}.sub.i at time stamp i, the Fusion module can use only L.sub.i, B.sub.i, and x.sub.i from the same time stamp, or use L.sub.i-n, , L.sub.i+m, B.sub.i-n, B.sub.i+m, and use x.sub.i-n, , x.sub.i+1 from a few neighbouring time stamps. ..."). 
 Regarding claim 4, JIANG et al. further discloses wherein the bitstream indicates one or more additional first sets of reference candidates that include one or more additional virtual reference frames associated with one or more additional image frames of a sequence of image frames (see page 8, paragraph #0092, "At S114, the decoded EFA features F.sub.b,1, F.sub.b,2, .., and the up-sampled sequence X=x.sub.1, x.sub.2, … are aggregated together by a Fusion module 139 to generate the final reconstructed video sequence X={circumflex over (x)}.sub.1, {circumflex over (x)}.sub.2, …, data 140. The Fusion module can be a small DNN, where for generating {circumflex over (x)}.sub.i at time stamp i, the Fusion module can use only L.sub.i, B.sub.i, and x.sub.i from the same time stamp, or use L.sub.i-n, , L.sub.i+m, B.sub.i-n, B.sub.i+m, and use x.sub.i-n, , x.sub.i+1 from a few neighbouring time stamps. ..."). 
 Regarding claim 5, JIANG et al. discloses wherein the bitstream further indicates a second set of reference candidates including one or more previously decoded image frames (see page 5, paragraphs #0065-#0066, "A Predictive picture (P picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most one motion vector and reference index to predict the sample values of each block" and "A Bi-directionally Predictive Picture (B Picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most two motion vectors and reference indices to predict the sample values of each block. Similarly, multiple-predictive pictures can use more than two reference pictures and associated metadata for the reconstruction of a single block").  
Regarding claim 6, JIANG et al. discloses wherein the bitstream includes a supplemental enhancement information (SEI) message indicating the synthesis support data (see page 3, paragraph #0038, "The video decoder 300 … The control information for the rendering device(s) may be in the form of Supplementary Enhancement Information (SEI messages) or Video Usability Information parameter set fragments (not depicted). .."). 
 Regarding claim 7, JIANG et al. discloses wherein the synthesis support data includes the facial landmark data indicating locations of facial features, and wherein the one or more processors are configured to generate the virtual reference frame based at least in part on a previously decoded image frame and the locations of facial features (see page 5, paragraph #0066, "A Bi-directionally Predictive Picture (B Picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most two motion vectors and reference indices to predict the sample values of each block. Similarly, multiple-predictive pictures can use more than two reference pictures and associated metadata for the reconstruction of a single block", page 7, paragraph #0088, " According to exemplary embodiments, to locate a pre-determined set of facial landmarks for each detected face (e.g., landmarks around left/right eyes, nose, mouse, etc.). ...", and page 8, paragraph #0091, "On the decoder side, and the decoded facial landmark features F.sub.l,1, F.sub.l,2, …, data 128. …"). 
Regarding claim 8, JIANG et al. discloses wherein the synthesis support data includes the motion-based data indicating global motion, and wherein the one or more processors are configured to generate the virtual reference frame based at least in part on a previously decoded image frame and the global motion (page 5, paragraphs #0065-#0066, "A Predictive picture (P picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most one motion vector and reference index to predict the sample values of each block" and "A Bi-directionally Predictive Picture (B Picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most two motion vectors and reference indices to predict the sample values of each block. Similarly, multiple-predictive pictures can use more than two reference pictures and associated metadata for the reconstruction of a single block"). 
 Regarding claim 9, JIANG et al. discloses wherein the one or more processors are configured to use the motion-based data to warp a previously decoded image frame to generate the virtual reference frame, wherein the synthesis support data includes the motion-based data (page 5, paragraphs #0065-#0066, "A Predictive picture (P picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most one motion vector and reference index to predict the sample values of each block" and "A Bi-directionally Predictive Picture (B Picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most two motion vectors and reference indices to predict the sample values of each block. Similarly, multiple-predictive pictures can use more than two reference pictures and associated metadata for the reconstruction of a single block").  
Regarding claim 10, JIANG et al. discloses wherein the one or more processors are configured to use a trained model to generate the virtual reference frame, and wherein an input to the trained model includes the synthesis support data and at least one previously decoded image frame (see page 5, paragraphs #0065-#0066, "A Predictive picture (P picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most one motion vector and reference index to predict the sample values of each block" and "A Bi-directionally Predictive Picture (B Picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most two motion vectors and reference indices to predict the sample values of each block. Similarly, multiple-predictive pictures can use more than two reference pictures and associated metadata for the reconstruction of a single block" and page 8, paragraph #0094, "Also, according to exemplary embodiments, there are several components in the proposed framework that needs to be trained, and such training will be described with respect to FIG. 13 which illustrates a workflow 1300 of an exemplary training process according to exemplary embodiments. …"). 
 Regarding claim 12, JIANG et al. discloses further comprising a display device configured to display the decoded version of the image frame (see page 2, paragraph #0032, "FIG. 1 illustrates a second pair of terminals 101 and 104 provided to support bidirectional transmission of coded video that may occur, for example, during videoconferencing. may decode the coded data and may display the recovered video data at a local display device."). 
 The method claim 13 is rejected for the same reason as discussed in the corresponding apparatus claim 1 above. 
 Regarding claim 14, JIANG et al. discloses a device (Fig. 12) comprising: one or more processors (see, page 1, paragraph #0006, "According to exemplary embodiments, there is included a method and apparatus comprising memory configured to store computer program code and a processor or processors configured to access ...") ") configured to: obtain synthesis support data associated with an image frame of a sequence of image frames, wherein the synthesis support data includes facial landmark data, motion-base data, or both the facial landmark data and the motion-base data, and wherein the facial landmark indicates locations of facial features included in the image frame (see page 5, paragraph #0066, "A Bi-directionally Predictive Picture (B Picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most two motion vectors and reference indices to predict the sample values of each block. Similarly, multiple-predictive pictures can use more than two reference pictures and associated metadata for the reconstruction of a single block", page 7, paragraph #0088, " According to exemplary embodiments, The Face Detection & Facial Landmark Extraction module 122 can use any face detector to locate face areas in each video frame xi, such as to locate a pre-determined set of facial landmarks for each detected face (e.g., landmarks around left/right eyes, nose, mouse, etc.). …”, and page 10, paragraph #0107, "Computer system 1400 ; virtual-reality glasses (not depicted), holographic displays and smoke tanks (not depicted)), and printers (not depicted)"); selectively generate a virtual reference frame based on the synthesis support data (see page 5, paragraph #0066, "A Bi-directionally Predictive Picture (B Picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most two motion vectors and reference indices to predict the sample values of each block. Similarly, multiple-predictive pictures can use more than two reference pictures and associated metadata for the reconstruction of a single block", page 7, paragraph #0088, " According to exemplary embodiments, The Face Detection & Facial Landmark Extraction module 122 can use any face detector to locate face areas in each video frame xi, such as to locate a pre-determined set of facial landmarks for each detected face (e.g., landmarks around left/right eyes, nose, mouse, etc.). …”, and page 8, paragraph #0092, "At S114, the decoded EFA features F.sub.b,1, F.sub.b,2, and the up-sampled sequence X=x.sub.1, x.sub.2, … are aggregated together by a Fusion module 139 to generate the final reconstructed video sequence X={circumflex over (x)}.sub.1, {circumflex over (x)}.sub.2, …, data 140. The Fusion module can be a small DNN, where for generating {circumflex over (x)}.sub.i at time stamp i, the Fusion module can use only L.sub.i, B.sub.i, and from the same time stamp, or use L.sub.i-n, , L.sub.i+m, B.sub.i-n, , B.sub.i+m, and use x.sub.i-n, , x.sub.i+1 from a few neighbouring time stamps. …"); and generate a bitstream corresponding to an encoded version of the image frame that is at least partially based on the virtual reference frame (see page 8, paragraph #0092, "At S114, the decoded EFA features F.sub.b,1, F.sub.b,2, and the up-sampled sequence X=x.sub.1, x.sub.2, … are aggregated together by a Fusion module 139 to generate the final reconstructed video sequence X={circumflex over (x)}.sub.1, {circumflex over (x)}.sub.2, …, data 140. The Fusion module, can be a small DNN, where for generating {circumflex over (x)}.sub.i at time stamp i, the Fusion module can use only L.sub.i, B.sub.i, and x.sub.i from the same time stamp, or use L.sub.i-n, , L.sub.i+m, B.sub.i-n, B.sub.i+m, and use x.sub.i-n, , x.sub.i+1 from a few neighbouring time stamps. …"). 
 Regarding claim 16, JIANG et al. discloses wherein the bitstream includes the synthesis support data (see page 8, paragraph #0092, "At S114, the decoded EFA features F.sub.b,1, F.sub.b,2, , and the p-sampled sequence X=x.sub.1, x.sub.2, … are aggregated together by a Fusion module 139 to generate the final reconstructed video sequence X={circumflex over (x)}.sub.1, {circumflex over (x)}.sub.2, …, data 140. The Fusion module can be a small DNN, where for generating {circumflex over (x)}.sub.i at time stamp i, the Fusion module can use only L.sub.i, B.sub.i, and x.sub.i from the same time stamp, or use L.sub.i-n, , L.sub.i+m, B.sub.i-n, , B.sub.i+m, and use x.sub.i-n, , x.sub.i+1 from a few neighbouring time stamps. …"). 
Regarding claim 17, JIANG et al. discloses wherein the one or more processors are configured to generate a first set of reference candidates that includes the virtual reference frame (see page 8, paragraph #0092, "At S114, the decoded EFA features F.sub.b,1, F.sub.b,2, and the up-sampled sequence X=x.sub.1, x.sub.2, … are aggregated together by a Fusion module 139 to generate the final reconstructed video sequence X={circumflex over (x)}.sub.1, {circumflex over (x)}.sub.2, …, data 140. The Fusion module can be a small DNN, where for generating {circumflex over (x)}.sub.i at time stamp i, the Fusion module can use only L.sub.i, B.sub.i, and x.sub.i from the same time stamp, or use L.sub.i-n, L.sub.i+m, B.sub.i-n, ,B.sub.i+m, and use x.sub.i-n, x.sub.i+1 from a few neighbouring time stamps. …"). 
 Regarding claim 18, JIANG et al. discloses wherein the bitstream indicates the first set of reference candidates (see page 8, paragraph #0092, "At S114, the decoded EFA features F.sub.b,1, F.sub.b,2, and the up-sampled sequence X=x.sub.1, x.sub.2, … are aggregated together by a Fusion module 139 to generate the final reconstructed video sequence X={circumflex over (x)}.sub.1, {circumflex over (x)}.sub.2, …, data 140. The Fusion module can be a small DNN, where for generating {circumflex over (x)}.sub.i at time stamp i, the Fusion module can use only L.sub.i, B.sub.i, and x.sub.i from the same time stamp, or use L.sub.i-n, , L.sub.i+m, B.sub.i-n, , B.sub.i+m, and use x.sub.i-n, , x.sub.i+1 from a few neighbouring time stamps …"). 
 Regarding claim 19, JIANG et al. discloses wherein the one or more processors are configured to generate one or more additional first sets of reference candidates that include one or more additional virtual reference frames associated with one or more additional image frames of the sequence of image frames (see page 8, paragraph #0092, "At S114, the decoded EFA features F.sub.b,1, F.sub.b,2, and the up-sampled sequence X=x.sub.1, x.sub.2, … are aggregated together by a Fusion module 139 to generate the final reconstructed video sequence X={circumflex over (x)}.sub.1, {circumflex over (x)}.sub.2, …, data 140. The Fusion module can be a small DNN, where for generating {circumflex over (x)}.sub.i at time stamp i, the Fusion module can use only L.sub.i, B.sub.i, and x.sub.i from the same time stamp, or use L.sub.i-n, , L.sub.i+m, B.sub.i-n, , B.sub.i+m, and use x.sub.i-n, , x.sub.i+1 from a few neighbouring time stamps "). 
Regarding claim 20, JIANG et al. discloses wherein the bitstream further indicates a second set of reference candidates including one or more previously decoded image frames, and wherein the one or more processors are configured to generate the virtual reference frame based at least in part on determining that a count of reference frames in the second set of reference candidates is less than a threshold reference count of a coding configuration (see page 5, paragraph #0066, "A Bi-directionally Predictive Picture (B Picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most two motion vectors and reference indices to predict the sample values of each block. Similarly, multiple-predictive pictures can use more than two reference pictures and associated metadata for the reconstruction of a single block"). 
 Regarding claim 21, JIANG et al. discloses wherein the one or more processors are configured to, based at least in part on detecting a face in the image frame, generate the virtual reference frame (see page 7, paragraph #0088, " According to exemplary embodiments, The Face Detection & Facial Landmark Extraction module 122 can use any face detector to locate face areas in each video frame xi, such as to locate a pre-determined set of facial landmarks for each detected face (e.g., landmarks around left/right eyes, nose, mouse, etc.). …").  
Regarding claim 22, JIANG et al. discloses wherein the one or more processors are configured to: obtain the motion-based data associated with the image frame (see page 5, paragraphs #0065-#0066, "A Predictive picture (P picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most one motion vector and reference index to predict the sample values of each block" and "A Bi-directionally Predictive Picture (B Picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most two motion vectors and reference indices to predict the sample values of each block. Similarly, multiple-predictive pictures can use more than two reference pictures and associated metadata for the reconstruction of a single block"); and based at least in part on determining that the motion-based data indicates global motion that is greater than a global motion threshold (see page 5, paragraphs #0065-#0066, "A Predictive picture (P picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most one motion vector and reference index to predict the sample values of each block" and "A Bi-directionally Predictive Picture (B Picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most two motion vectors and reference indices to predict the sample values of each block. Similarly, multiple-predictive pictures can use more than two reference pictures and associated metadata for the reconstruction of a single block"), generate the virtual reference frame (see page 8, paragraph #0092, "At S114, the decoded EFA features F.sub.b,1, F.sub.b,2, , and the up-sampled sequence X=x.sub.1, x.sub.2, … are aggregated together by a Fusion module 139 to generate the final reconstructed video sequence X={circumflex over (x)}.sub.1, {circumflex over (x)}.sub.2, …, data 140. The Fusion module can be a small DNN, where for generating {circumflex over (x)}.sub.i at time stamp i, the Fusion module can use only L.sub.i, B.sub.i, and x.sub.i from the same time stamp, or use L.sub.i-n, ,L.sub.i+m, B.sub.i-n, B.sub.i+m, and use x.sub.i-n, , x.sub.i+1 from a few neighbouring time stamps. … "). 
Regarding claim 23, JIANG et al. discloses wherein the synthesis support data includes facial landmark data that indicates locations of facial features in the image frame (see page 7, paragraph #0088, " According to exemplary embodiments, The Face Detection & Facial Landmark Extraction module 122 can use any face detector to locate face areas in each video frame xi, such as to locate a pre-determined set of facial landmarks for each detected face (e.g., landmarks around left/right eyes, nose, mouse, etc.). …"). 
Regarding claim 24, JIANG et al. discloses wherein the synthesis support data includes motion sensor data indicating motion of an image capture device associated with the image frame (see page 5, paragraphs #0065-#0066, "A Predictive picture (P picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most one motion vector and reference index to predict the sample values of each block" and "A Bi-directionally Predictive Picture (B Picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most two motion vectors and reference indices to predict the sample values of each block. Similarly, multiple-predictive pictures can use more than two reference pictures and associated metadata for the reconstruction of a single block"). 
Regarding claim 25, JIANG et al. discloses wherein the image capture device includes at least one of an extended reality (XR) device, a vehicle, or a camera (see page 4, paragraph #0052, "The video source 401 may provide the source the video source 401 may be a camera that captures local image information as a video sequence. …"). 
 Regarding claim 26, JIANG et al. discloses wherein the one or more processors are configured to use the motion-based data to warp a previously decoded image frame to generate the virtual reference frame, wherein the synthesis support data includes the motion-based data (see page 5, paragraphs #0065-#0066, "A Predictive picture (P picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most one motion vector and reference index to predict the sample values of each block" and "A Bi-directionally Predictive Picture (B Picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most two motion vectors and reference indices to predict the sample values of each block. Similarly, multiple-predictive pictures can use more than two reference pictures and associated metadata for the reconstruction of a single block"). 
Regarding claim 27, JIANG et al. discloses wherein the bitstream includes a supplemental enhancement information (SEI) message indicating virtual reference frame usage to generate a decoded version of the image frame (see page 3, paragraph #0038, "The video decoder 300 … The control information for the rendering device(s) may be in the form of Supplementary Enhancement Information (SEI messages) or Video Usability Information parameter set fragments (not depicted). "). 
 Regarding claim 28, JIANG et al. discloses wherein the one or more processors are configured to use a trained model to generate the virtual reference frame, and wherein input to the trained model includes the synthesis support data and at least one previously decoded image frame (see page 5, paragraph #0066, "A Bi-directionally Predictive Picture (B Picture) may be one that may be coded and decoded using intra prediction or inter prediction using at most two motion vectors and reference indices to predict the sample values of each block. Similarly, multiple-predictive pictures can use more than two reference pictures and associated metadata for the reconstruction of a single block" and page 8, paragraph #0094, "Also, according to exemplary embodiments, there are several components in the proposed framework that needs to be trained, and such training will be described with respect to FIG. 13 which illustrates a workflow 1300 of an exemplary training process according to exemplary embodiments. …"). 
 The method claim 30 is rejected for the same reason as discussed in the corresponding apparatus claim 14 above. 
 Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 11 and 29 are rejected under 35 U.S.C. 103 as being unpatentable over JIANG et al. (US 2022/0217371 A1) as set forth in the last Office Action. 
 Regarding claim 11, JIANG et al. discloses all the claimed limitations as discussed in claim 1 above including that the terminals 101, 102, 103, and 104 are connected to each other using wireline and/or wireless communication networks (see page 2, paragraph #0033. "In FIG. 1, the terminals 101, 102, 103 and 104 may be illustrated as servers, personal computers and smart phones but the principles of the present disclosure are not so limited. … The network 105 represents any number of networks that convey coded video data among the terminals 101, 102, 103 and 104, including for example wireline and/or wireless communication networks." ...) except for providing a modem configured to transmit the bitstream to a second device. 
 The use of modem to transmit data between terminals is old and well-known in the art and Official Notice is taken. 
 It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the well-known modem to transmit data between terminals 101, 102, 103, and 104 of JIANG et al. since it merely amounts to selecting equivalent available transmitters between terminals. 
 Claim 29 is rejected for the same reasons as discussed in claim 11 above.
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to THAI Q TRAN whose telephone number is (571)272-7382. The examiner can normally be reached Monday to Friday from 10:00am to 6:30pm..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Colleen Fauz can be reached at (571) 272-1667. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/THAI Q TRAN/Supervisory Patent Examiner, Art Unit 2484
Read full office action
Prosecution Timeline

Feb 14, 2023
Application Filed
Feb 25, 2025
Non-Final Rejection mailed — §102, §103
May 23, 2025
Response Filed
Jul 01, 2025
Non-Final Rejection mailed — §102, §103
Oct 01, 2025
Response Filed
Apr 13, 2026
Final Rejection mailed — §102, §103
May 26, 2026
Interview Requested
Precedent Cases

Applications granted by this same examiner with similar technology

18/418,641
Patent 12641555
CLOCK SYNCHRONIZATION METHOD AND COMMUNICATION APPARATUS
2y 4m to grant Granted May 26, 2026
18/672,408
Patent 12625032
IMAGE-BASED BEARING FAILURE DETECTION
1y 11m to grant Granted May 12, 2026
18/931,088
Patent 12603984
DENSE-VIEWPOINT THREE-DIMENSIONAL DISPLAY SYSTEM WITH DISCRETELY -ARRANGED EYEBOXES AND DISPLAY METHOD THEREOF
1y 5m to grant Granted Apr 14, 2026
18/719,724
Patent 12568196
AUTOSTEREOSCOPIC DISPLAY DEVICE PRESENTING 3D-VIEW AND 3D-SOUND
1y 8m to grant Granted Mar 03, 2026
18/647,467
Patent 12563168
ENGINEERED CUT-OUTS FOR A DISPLAY BACK LIGHT UNIT
1y 10m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

4-5
Expected OA Rounds
27%
Grant Probability
23%
With Interview (-4.1%)
4y 9m (~1y 6m remaining)
Median Time to Grant
High
PTA Risk
Based on 37 resolved cases by this examiner. Grant probability derived from career allowance rate.