Prosecution Insights
Last updated: May 29, 2026
Application No. 18/634,519

ARTIFICIAL INTELLIGENCE DEVICE FOR ROBUST MULTIMODAL ENCODER FOR PERSON REPRESENTATIONS AND CONTROL METHOD THEREOF

Non-Final OA §103
Filed
Apr 12, 2024
Priority
Apr 13, 2023 — provisional 63/459,257
Examiner
COLUCCI, MICHAEL C
Art Unit
2655
Tech Center
2600 — Communications
Assignee
LG Electronics Inc.
OA Round
2 (Non-Final)
76%
Grant Probability
Favorable
2-3
OA Rounds
1y 0m
Est. Remaining
91%
With Interview

Examiner Intelligence

Grants 76% — above average
76%
Career Allowance Rate
758 granted / 999 resolved
+13.9% vs TC avg
Strong +15% interview lift
Without
With
+15.2%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
32 currently pending
Career history
1033
Total Applications
across all art units

Statute-Specific Performance

§101
3.6%
-36.4% vs TC avg
§103
86.9%
+46.9% vs TC avg
§102
2.9%
-37.1% vs TC avg
§112
1.1%
-38.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 999 resolved cases

Office Action

§103
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . DETAILED ACTION Response to Arguments Applicant's arguments filed 12/23/2025 have been fully considered but they are not persuasive. On page 10 of 12 of the arguments Applicant argues that the amended limitation from cancelled claims 3 and 4 are not taught by the combination and lack a motivation to combine. Examiner does not concur, as the concept of summed losses for user authentication is present in both Qian and SARAF, where Qian teaches audio-visual embedding layers with loss summation and SARAF teaches audio embedding layers with loss summation with respect to age. The concept is analogous to the present inventions fig. 7, Qian teaches fig. 3c known as multi-modal fusion which explicitly comprises an end-to-end concept with fusion and multi-modal audio-visual embedding as in Section V. on page 1084, and further with multi-task learning which explicitly utilizes multiple losses for instance in an overall view of losses applied to fig. 3c, in other words, under BRI the claimed GE2E-MM loss is simply an end-to-end fusion multi-modal network with multi-loss calculations See Section V. sub-sections A. and B. While Qian teaches the audio-visual embedding component, SARAF teaches the audio component of embedded data as part of multi-modal system in which audio can be part of a multi-media input file and also suggested by SARAF 0031, and further as in 0009 with 0023 and fig. 4c there is clearly embeddings related to loss due to age as a totaled sum for increased accuracy of authentication. SARAF teaches a loss component is added among other loss components with respect to age as in 0009 with 0023 and fig. 4c, here the goal is to sum losses to produce more accurate training using multi-media embedding e.g. voice of a user suggested as part of audio-visual per se as in 0031 Therefore, the motivation for such a combination amounts to using the existing audio-visual authentication with loss meshed with the audio authentication with age loss of SARAF to differentiate between a live, aging human and a static photo or synthetic deepfake, as the system can detect if the age characteristics match a previous profile or expected age range, wherein by adding a second constraint (age) to the biometric match (identity), the likelihood of a successful fraudulent login is reduced, as a hacker would need to match both the voice/face AND the estimated age Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1, 2, 6, 7, 9-12, and 16, 17, 19, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Qian et al., Audio-Visual Deep Neural Network for Robust Person Verification, Feb 8th 2021, IEEE Vol 29, 1079-1092 (hereinafter Qian) in view of US 20230107741 A1 SARAF; Amruta et al. (hereinafter SARAF). Re claim 1, Qian teaches 1. A method for controlling an artificial intelligence (AI) device, the method comprising: (Abstract e.g. AV DNN) obtaining, via a processor in the AI device, a video sample of a user and an audio sample of the user; (Audio + video as in 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1) generating, via a neural network, a visual embedding based on the video sample and an audio embedding based on the audio sample, the visual embedding and the audio embedding being multi-dimensional vectors; (Audio + video embeddings as in 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1) generating, via the neural network, an audio-visual embedding based on a combination of the visual embedding and the audio embedding; (Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth… supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1) determining, via the processor, a specific pre-enrolled audio-visual embedding from among a plurality of pre-enrolled audio-visual embeddings corresponding pre-enrolled users based on a distance away from the audio-visual embedding within a joint audio-visual subspace; and (Any user face/audio data know beforehand is analogous to pre-enrolled per se or pre-known, utilizing distances for loss on pair values 1085 sec. A…using audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth… supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1) …wherein the neural network is trained based on a loss function that uses a plurality of audio-visual embeddings, each of the plurality of audio-visual embeddings includes an audio component and a visual component. (Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth… supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1) and wherein the loss function is based on a combined multi-task loss that includes a generalized end-to-end multi-modal (GE2E-MM) loss based on the audio component and the visual component… (analogous to the present inventions fig. 7, Qian teaches fig. 3c known as multi-modal fusion which explicitly comprises an end-to-end concept with fusion and multi-modal audio-visual embedding as in Section V. on page 1084, and further with multi-task learning which explicitly utilizes multiple losses for instance in an overall view of losses applied to fig. 3c, in other words, under BRI the claimed GE2E-MM loss is simply an end-to-end fusion multi-modal network with multi-loss calculations See Section V. 1085 sub-sections A. and B. …supplemented with e2e multi-modal 1080 col 1 with Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth… supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1) However, while Qian teaches audio-visual embedding, pre-existing user data and person verification/authentication for A/V matching, it does not necessarily fully teach authentication in the context of having a user profile or account per se for enrollment and verification purposes, thus failing to teach at least in part: verifying, via the processor, the user as the specific pre-enrolled user corresponding to the specific pre-enrolled audio-visual embedding (while Qian teaches the audio-visual embedding component, SARAF teaches the audio component of embedded data as part of multi-modal system in which audio can be part of a multi-media input file and also suggested by SARAF 0031, and further as in 0009 with 0023 and fig. 4c there is clearly embeddings related to loss due to age as a totaled sum for increased accuracy of authentication, already enrolled users are verified with various user data such as age, etc. utilizing loss 0007-0009, where a user has a personalized or custom profile/account which is compared to input data based on a scoring distance 0080-0081, in various modalities such as e2e with loss using aforementioned age and other parameters 0084 with fig. 4a-4c) …added to an auxiliary task loss corresponding to an age component (SARAF teaches a loss component is added among other loss components with respect to age as in 0009 with 0023 and fig. 4c, here the goal is to sum losses to produce more accurate training using multi-media embedding e.g. voice of a user suggested as part of audio-visual per se as in 0031) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Qian to incorporate the above claim limitations as taught by SARAF to allow for combining prior art elements such as verification of users with another loss element using audio + visual using known user information prior or in a historical time-line according to known methods such as using enrolled users with a profile to yield predicable results, thereby using the existing audio-visual authentication with loss meshed with the audio authentication with age loss of SARAF to differentiate between a live, aging human and a static photo or synthetic deepfake, as the system can detect if the age characteristics match a previous profile or expected age range, wherein by adding a second constraint (age) to the biometric match (identity), the likelihood of a successful fraudulent login is reduced, as a hacker would need to match both the voice/face AND the estimated age, and utilizing explicit user profile or accounts with customizable data stored per user rather than in a cache per se, using biometrics and demographic information within a signal frequency/time as well as comparison thereof, with a user account. Re claim 11, this claim has been rejected for teaching a broader, or narrower claim based on general inclusion of hardware alone (e.g. processor, memory, instructions), representation of claim 1 omitting/including hardware for instance, otherwise amounting to a virtually identical scope Re claims 2 and 12, while Qian teaches pre-existing user data and person verification/authentication for A/V matching, it does not necessarily fully teach authentication in the context of having a user profile or account per se, linked to additional biometric or demographic data or user-specific custom features, for enrollment and verification purposes, thus failing to teach at least in part: 2. The method of claim 1, further comprising: outputting personalized content for the user based on the verifying the user as the specific pre-enrolled user. (SARAF already enrolled users are verified with various user data such as age, etc. utilizing loss 0007-0009, where a user has a personalized or custom profile/account which is compared to input data based on a scoring distance 0080-0081, in various modalities such as e2e with loss using aforementioned age and other parameters 0084 with fig. 4a-4c) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Qian to incorporate the above claim limitations as taught by SARAF to allow for combining prior art elements such as verification of users using known information prior or in a historical time-line according to known methods such as using enrolled users with a profile to yield predicable results, thereby utilizing explicit user profile or accounts with customizable data stored per user rather than in a cache per se, using biometrics and demographic information within a signal frequency/time as well as comparison thereof, with a user account. Re claims 6 and 16, 6. The method of claim 1, wherein the video sample includes a face track cropping of the user including a plurality of frames, and wherein the audio sample includes a recording of a voice of the user. (input of a user voice or recording at a point in time in frames e.g. a video to form image frames and segmented audio in general during analysis, Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth… supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1) Re claims 7 and 17, Qian teaches 7. The method of claim 1, further comprising: training the neural network based on batching N x M audio and visual inputs to update weights of the neural network, where N and M correspond to unique speakers and unique audio-visual utterances for each of the unique speakers, respectively. (supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1… Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth) Re claims 9 and 19, Qian teaches 9. The method of claim 1, further comprising: transforming, via the neural network, the audio embedding into a projected audio embedding projected onto the joint audio-visual subspace; (as in fig. 3 c. transform layers…Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth…) transforming, via the neural network, the visual embedding into a projected visual embedding projected onto the joint audio-visual subspace; (as in fig. 3 c. transform layers…Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth…) multiplying, via the neural network, the projected audio embedding by a voice attention weight to generate a weighted audio embedding; (as in fig. 3 c. attention layer weight…Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth…) multiplying, via the neural network, the projected visual embedding by a face attention weight to generate a weighted visual embedding; and (as in fig. 3 c. attention layer weight…Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth…) summing, via the neural network, the weighted audio embedding and the weighted visual embedding to generate the audio-visual embedding corresponding to the user. (as in fig. 3 c. summation or combination to produce A/V embedding…Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth…) Re claim 10, Qian teaches 10. The method of claim 1, wherein the voice attention weight added to face attention weight equal 1. (with a single attention weight at varying values from 0 to xth supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1… Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth) Re claim 20, Qian teaches 20. A method for controlling an artificial intelligence (AI) device, the method comprising: (fig. 6) obtaining, via a processor in the AI device, a video sample of a user and an audio sample of the user; (Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth… supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1) generating, via a neural network, a visual embedding based on the video sample and an audio embedding based on the audio sample, the visual embedding and the audio embedding being multi-dimensional vectors; and (image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1, Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth… supported with 1087 section B. 1.) generating, via the neural network, an audio-visual embedding based on a combination of the visual embedding and the audio embedding, the audio-visual embedding being a biometric of the user, (Abstract biometrics, and Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth… supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1) wherein the neural network is trained based on a loss function that uses a plurality of audio-visual embeddings, each of the plurality of audio-visual embeddings includes an audio component and a visual component. (Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth… supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1) and wherein the loss function is based on a combined multi-task loss that includes a generalized end-to-end multi-modal (GE2E-MM) loss based on the audio component and the visual component… (analogous to the present inventions fig. 7, Qian teaches fig. 3c known as multi-modal fusion which explicitly comprises an end-to-end concept with fusion and multi-modal audio-visual embedding as in Section V. on page 1084, and further with multi-task learning which explicitly utilizes multiple losses for instance in an overall view of losses applied to fig. 3c, in other words, under BRI the claimed GE2E-MM loss is simply an end-to-end fusion multi-modal network with multi-loss calculations See Section V. 1085 sub-sections A. and B. …supplemented with e2e multi-modal 1080 col 1 with Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth… supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1) However, while Qian teaches audio-visual embedding, pre-existing user data and person verification/authentication for A/V matching, it does not necessarily fully teach authentication in the context of having a user profile or account per se for enrollment and verification purposes, thus failing to teach at least in part: …added to an auxiliary task loss corresponding to an age component (SARAF teaches a loss component is added among other loss components with respect to age as in 0009 with 0023 and fig. 4c, here the goal is to sum losses to produce more accurate training using multi-media embedding e.g. voice of a user suggested as part of audio-visual per se as in 0031) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Qian to incorporate the above claim limitations as taught by SARAF to allow for combining prior art elements such as verification of users using known information prior or in a historical time-line according to known methods such as using enrolled users with a profile to yield predicable results, thereby utilizing explicit user profile or accounts with customizable data stored per user rather than in a cache per se, using biometrics and demographic information within a signal frequency/time as well as comparison thereof, with a user account. Allowable Subject Matter Claims 5 and 15 as well as 8 and 18, are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. Conclusion THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. US 11238843 B2 Arik; Sercan O. et al. User profiles for A/V Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL COLUCCI whose telephone number is (571)270-1847. The examiner can normally be reached on M-F 9 AM - 7 PM. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at (571)272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /MICHAEL COLUCCI/Primary Examiner, Art Unit 2655 (571)-270-1847 Examiner FAX: (571)-270-2847 Michael.Colucci@uspto.gov
Read full office action

Prosecution Timeline

Apr 12, 2024
Application Filed
Nov 05, 2025
Non-Final Rejection mailed — §103
Dec 23, 2025
Response Filed
Feb 20, 2026
Final Rejection mailed — §103
Mar 26, 2026
Response after Non-Final Action
May 20, 2026
Response after Non-Final Action

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12640144
Generating Synthetic Conference Transcripts Using Natural Language Processing
2y 4m to grant Granted May 26, 2026
Patent 12633286
MACHINE LEARNING MODEL IMPROVEMENT
2y 4m to grant Granted May 19, 2026
Patent 12626697
SYSTEM AND METHOD FOR KEYWORD FALSE ALARM REDUCTION
2y 10m to grant Granted May 12, 2026
Patent 12620262
USING ARTIFICIAL ENTITIES FOR GENERATING PERSONALIZED RESPONSES
11m to grant Granted May 05, 2026
Patent 12592240
ENCODING AND DECODING OF ACOUSTIC ENVIRONMENT
2y 4m to grant Granted Mar 31, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

2-3
Expected OA Rounds
76%
Grant Probability
91%
With Interview (+15.2%)
3y 1m (~1y 0m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 999 resolved cases by this examiner. Grant probability derived from career allowance rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month