Last updated: May 29, 2026

Application No. 18/634,519

ARTIFICIAL INTELLIGENCE DEVICE FOR ROBUST MULTIMODAL ENCODER FOR PERSON REPRESENTATIONS AND CONTROL METHOD THEREOF

Non-Final OA §103

Filed

Apr 12, 2024

Priority

Apr 13, 2023 — provisional 63/459,257

Examiner

COLUCCI, MICHAEL C

Art Unit

2655

Tech Center

2600 — Communications

Assignee

LG Electronics Inc.

OA Round

2 (Non-Final)

Interview Optional

— +15.2% interview lift. Examiner has a relatively high allowance rate (76%); +15.2% interview lift. A written response may suffice.

Based on 999 resolved cases, 2023–2026

Examiner Intelligence

COLUCCI, MICHAEL C View full profile →

Grants 76% — above average

Career Allowance Rate

758 granted / 999 resolved

+13.9% vs TC avg

Strong +15% interview lift

Without

With

+15.2%

Interview Lift

resolved cases with interview

Typical timeline

3y 1m

Avg Prosecution

32 currently pending

Career history

1033

Total Applications

across all art units

Statute-Specific Performance

§101

3.6%

-36.4% vs TC avg

§103

86.9%

+46.9% vs TC avg

§102

2.9%

-37.1% vs TC avg

§112

1.1%

-38.9% vs TC avg

Black line = Tech Center average estimate • Based on career data from 999 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION

Response to Arguments
Applicant's arguments filed 12/23/2025 have been fully considered but they are not persuasive. On page 10 of 12 of the arguments Applicant argues that the amended limitation from cancelled claims 3 and 4 are not taught by the combination and lack a motivation to combine. 
Examiner does not concur, as the concept of summed losses for user authentication is present in both Qian and SARAF, where Qian teaches audio-visual embedding layers with loss summation and SARAF teaches audio embedding layers with loss summation with respect to age. 
The concept is analogous to the present inventions fig. 7, Qian teaches fig. 3c known as multi-modal fusion which explicitly comprises an end-to-end concept with fusion and multi-modal audio-visual embedding as in Section V. on page 1084, and further with multi-task learning which explicitly utilizes multiple losses for instance in an overall view of losses applied to fig. 3c, in other words, under BRI the claimed GE2E-MM loss is simply an end-to-end fusion multi-modal network with multi-loss calculations See Section V. sub-sections A. and B.
While Qian teaches the audio-visual embedding component, SARAF teaches the audio component of embedded data as part of multi-modal system in which audio can be part of a multi-media input file and also suggested by SARAF 0031, and further as in 0009 with 0023 and fig. 4c there is clearly embeddings related to loss due to age as a totaled sum for increased accuracy of authentication.
SARAF teaches a loss component is added among other loss components with respect to age as in 0009 with 0023 and fig. 4c, here the goal is to sum losses to produce more accurate training using multi-media embedding e.g. voice of a user suggested as part of audio-visual per se as in 0031
Therefore, the motivation for such a combination amounts to using the existing audio-visual authentication with loss meshed with the audio authentication with age loss of SARAF to differentiate between a live, aging human and a static photo or synthetic deepfake, as the system can detect if the age characteristics match a previous profile or expected age range, wherein by adding a second constraint (age) to the biometric match (identity), the likelihood of a successful fraudulent login is reduced, as a hacker would need to match both the voice/face AND the estimated age


Claim Rejections - 35 USC § 103

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 2, 6, 7, 9-12, and 16, 17, 19, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Qian et al., Audio-Visual Deep Neural Network for Robust Person Verification, Feb 8th 2021, IEEE Vol 29, 1079-1092 (hereinafter Qian) in view of US 20230107741 A1 SARAF; Amruta et al. (hereinafter SARAF).
Re claim 1, Qian teaches 
1. A method for controlling an artificial intelligence (AI) device, the method comprising: (Abstract e.g. AV DNN)
obtaining, via a processor in the AI device, a video sample of a user and an audio sample of the user; (Audio + video as in 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1)
generating, via a neural network, a visual embedding based on the video sample and an audio embedding based on the audio sample, the visual embedding and the audio embedding being multi-dimensional vectors; (Audio + video embeddings as in 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1)
generating, via the neural network, an audio-visual embedding based on a combination of the visual embedding and the audio embedding; (Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth… supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1)
determining, via the processor, a specific pre-enrolled audio-visual embedding from among a plurality of pre-enrolled audio-visual embeddings corresponding pre-enrolled users based on a distance away from the audio-visual embedding within a joint audio-visual subspace; and (Any user face/audio data know beforehand is analogous to pre-enrolled per se or pre-known, utilizing distances for loss on pair values 1085 sec. A…using audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth… supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1)
…wherein the neural network is trained based on a loss function that uses a plurality of audio-visual embeddings, each of the plurality of audio-visual embeddings includes an audio component and a visual component. (Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth… supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1)
and wherein the loss function is based on a combined multi-task loss that includes a generalized end-to-end multi-modal (GE2E-MM) loss based on the audio component and the visual component… (analogous to the present inventions fig. 7, Qian teaches fig. 3c known as multi-modal fusion which explicitly comprises an end-to-end concept with fusion and multi-modal audio-visual embedding as in Section V. on page 1084, and further with multi-task learning which explicitly utilizes multiple losses for instance in an overall view of losses applied to fig. 3c, in other words, under BRI the claimed GE2E-MM loss is simply an end-to-end fusion multi-modal network with multi-loss calculations See Section V. 1085 sub-sections A. and B. …supplemented with e2e multi-modal 1080 col 1 with Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth… supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1)

However, while Qian teaches audio-visual embedding, pre-existing user data and person verification/authentication for A/V matching, it does not necessarily fully teach authentication in the context of having a user profile or account per se for enrollment and verification purposes, thus failing to teach at least in part:
verifying, via the processor, the user as the specific pre-enrolled user corresponding to the specific pre-enrolled audio-visual embedding (while Qian teaches the audio-visual embedding component, SARAF teaches the audio component of embedded data as part of multi-modal system in which audio can be part of a multi-media input file and also suggested by SARAF 0031, and further as in 0009 with 0023 and fig. 4c there is clearly embeddings related to loss due to age as a totaled sum for increased accuracy of authentication, already enrolled users are verified with various user data such as age, etc. utilizing loss 0007-0009, where a user has a personalized or custom profile/account which is compared to input data based on a scoring distance 0080-0081, in various modalities such as e2e with loss using aforementioned age and other parameters 0084 with fig. 4a-4c)
…added to an auxiliary task loss corresponding to an age component (SARAF teaches a loss component is added among other loss components with respect to age as in 0009 with 0023 and fig. 4c, here the goal is to sum losses to produce more accurate training using multi-media embedding e.g. voice of a user suggested as part of audio-visual per se as in 0031)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Qian to incorporate the above claim limitations as taught by SARAF to allow for combining prior art elements such as verification of users with another loss element using audio + visual using known user information prior or in a historical time-line according to known methods such as using enrolled users with a profile to yield predicable results, thereby using the existing audio-visual authentication with loss meshed with the audio authentication with age loss of SARAF to differentiate between a live, aging human and a static photo or synthetic deepfake, as the system can detect if the age characteristics match a previous profile or expected age range, wherein by adding a second constraint (age) to the biometric match (identity), the likelihood of a successful fraudulent login is reduced, as a hacker would need to match both the voice/face AND the estimated age, and utilizing explicit user profile or accounts with customizable data stored per user rather than in a cache per se, using biometrics and demographic information within a signal frequency/time as well as comparison thereof, with a user account.


Re claim 11, this claim has been rejected for teaching a broader, or narrower claim based on general inclusion of hardware alone (e.g. processor, memory, instructions), representation of claim 1 omitting/including hardware for instance, otherwise amounting to a virtually identical scope


Re claims 2 and 12, while Qian teaches pre-existing user data and person verification/authentication for A/V matching, it does not necessarily fully teach authentication in the context of having a user profile or account per se, linked to additional biometric or demographic data or user-specific custom features, for enrollment and verification purposes, thus failing to teach at least in part:
2. The method of claim 1, further comprising: outputting personalized content for the user based on the verifying the user as the specific pre-enrolled user. (SARAF already enrolled users are verified with various user data such as age, etc. utilizing loss 0007-0009, where a user has a personalized or custom profile/account which is compared to input data based on a scoring distance 0080-0081, in various modalities such as e2e with loss using aforementioned age and other parameters 0084 with fig. 4a-4c)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Qian to incorporate the above claim limitations as taught by SARAF to allow for combining prior art elements such as verification of users using known information prior or in a historical time-line according to known methods such as using enrolled users with a profile to yield predicable results, thereby utilizing explicit user profile or accounts with customizable data stored per user rather than in a cache per se, using biometrics and demographic information within a signal frequency/time as well as comparison thereof, with a user account.


Re claims 6 and 16, 
6. The method of claim 1, wherein the video sample includes a face track cropping of the user including a plurality of frames, and wherein the audio sample includes a recording of a voice of the user. (input of a user voice or recording at a point in time in frames e.g. a video to form image frames and segmented audio in general during analysis, Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth… supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1)


Re claims 7 and 17, Qian teaches
7. The method of claim 1, further comprising: training the neural network based on batching N x M audio and visual inputs to update weights of the neural network, where N and M correspond to unique speakers and unique audio-visual utterances for each of the unique speakers, respectively. (supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1… Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth)


Re claims 9 and 19, Qian teaches
9. The method of claim 1, further comprising: 
transforming, via the neural network, the audio embedding into a projected audio embedding projected onto the joint audio-visual subspace; (as in fig. 3 c. transform layers…Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth…)
transforming, via the neural network, the visual embedding into a projected visual embedding projected onto the joint audio-visual subspace; (as in fig. 3 c. transform layers…Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth…)
multiplying, via the neural network, the projected audio embedding by a voice attention weight to generate a weighted audio embedding; (as in fig. 3 c. attention layer weight…Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth…)
multiplying, via the neural network, the projected visual embedding by a face attention weight to generate a weighted visual embedding; and (as in fig. 3 c. attention layer weight…Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth…)
summing, via the neural network, the weighted audio embedding and the weighted visual embedding to generate the audio-visual embedding corresponding to the user. (as in fig. 3 c. summation or combination to produce A/V embedding…Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth…)


Re claim 10, Qian teaches
10. The method of claim 1, wherein the voice attention weight added to face attention weight equal 1. (with a single attention weight at varying values from 0 to xth supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1… Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth)


Re claim 20, Qian teaches
20. A method for controlling an artificial intelligence (AI) device, the method comprising: (fig. 6)
obtaining, via a processor in the AI device, a video sample of a user and an audio sample of the user; (Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth… supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1)
generating, via a neural network, a visual embedding based on the video sample and an audio embedding based on the audio sample, the visual embedding and the audio embedding being multi-dimensional vectors; and (image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1, Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth… supported with 1087 section B. 1.)
generating, via the neural network, an audio-visual embedding based on a combination of the visual embedding and the audio embedding, the audio-visual embedding being a biometric of the user, (Abstract biometrics, and Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth… supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1)
wherein the neural network is trained based on a loss function that uses a plurality of audio-visual embeddings, each of the plurality of audio-visual embeddings includes an audio component and a visual component. (Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth… supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1)
and wherein the loss function is based on a combined multi-task loss that includes a generalized end-to-end multi-modal (GE2E-MM) loss based on the audio component and the visual component… (analogous to the present inventions fig. 7, Qian teaches fig. 3c known as multi-modal fusion which explicitly comprises an end-to-end concept with fusion and multi-modal audio-visual embedding as in Section V. on page 1084, and further with multi-task learning which explicitly utilizes multiple losses for instance in an overall view of losses applied to fig. 3c, in other words, under BRI the claimed GE2E-MM loss is simply an end-to-end fusion multi-modal network with multi-loss calculations See Section V. 1085 sub-sections A. and B. …supplemented with e2e multi-modal 1080 col 1 with Audio + video embeddings combined as in fig. 6 with losses and also complex attention weights for A-V output embedding as in fig. 3 C. and 1082 Sec. B2 with a single attention weight at varying values from 0 to xth… supported with 1087 section B. 1. image segments from a video and fig. 6 and audio + visual data as part of a neural network to become embedded output vectors 1084 sec. v. sec. A. exemplified with dimensionality as NxM weighted vector and/or matrices 1082 Sec. III with Table 1)

However, while Qian teaches audio-visual embedding, pre-existing user data and person verification/authentication for A/V matching, it does not necessarily fully teach authentication in the context of having a user profile or account per se for enrollment and verification purposes, thus failing to teach at least in part:
…added to an auxiliary task loss corresponding to an age component (SARAF teaches a loss component is added among other loss components with respect to age as in 0009 with 0023 and fig. 4c, here the goal is to sum losses to produce more accurate training using multi-media embedding e.g. voice of a user suggested as part of audio-visual per se as in 0031)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Qian to incorporate the above claim limitations as taught by SARAF to allow for combining prior art elements such as verification of users using known information prior or in a historical time-line according to known methods such as using enrolled users with a profile to yield predicable results, thereby utilizing explicit user profile or accounts with customizable data stored per user rather than in a cache per se, using biometrics and demographic information within a signal frequency/time as well as comparison thereof, with a user account.


Allowable Subject Matter
Claims 5 and 15 as well as 8 and 18, are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. 


Conclusion

THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 

US 11238843 B2 Arik; Sercan O. et al.
User profiles for A/V

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL COLUCCI whose telephone number is (571)270-1847.  The examiner can normally be reached on M-F 9 AM - 7 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at (571)272-7516.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/MICHAEL COLUCCI/Primary Examiner, Art Unit 2655                                                                                                                                                                                               (571)-270-1847
Examiner FAX:  (571)-270-2847
Michael.Colucci@uspto.gov

Read full office action

Prosecution Timeline

Apr 12, 2024

Application Filed

Nov 05, 2025

Non-Final Rejection mailed — §103

Dec 23, 2025

Response Filed

Feb 20, 2026

Final Rejection mailed — §103

Mar 26, 2026

Response after Non-Final Action

May 20, 2026

Response after Non-Final Action

Precedent Cases

Applications granted by this same examiner with similar technology

18/422,681

Patent 12640144

Generating Synthetic Conference Transcripts Using Natural Language Processing

2y 4m to grant Granted May 26, 2026

18/401,171

Patent 12633286

MACHINE LEARNING MODEL IMPROVEMENT

2y 4m to grant Granted May 19, 2026

18/352,601

Patent 12626697

SYSTEM AND METHOD FOR KEYWORD FALSE ALARM REDUCTION

2y 10m to grant Granted May 12, 2026

19/225,487

Patent 12620262

USING ARTIFICIAL ENTITIES FOR GENERATING PERSONALIZED RESPONSES

11m to grant Granted May 05, 2026

18/515,502

Patent 12592240

ENCODING AND DECODING OF ACOUSTIC ENVIRONMENT

2y 4m to grant Granted Mar 31, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

2-3

Expected OA Rounds

76%

Grant Probability

91%

With Interview (+15.2%)

3y 1m (~1y 0m remaining)

Median Time to Grant

Moderate

PTA Risk

Based on 999 resolved cases by this examiner. Grant probability derived from career allowance rate.