DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 02/04/2026 has been entered.
Examiner Notes
(1) In the case of amending the Claimed invention, Applicant is respectfully requested to indicate the portion(s) of the specification which dictate(s) the structure relied on for proper interpretation and also to verify and ascertain the metes and bounds of the claimed invention. This will assist in expediting compact prosecution. MPEP 714.02 recites: “Applicant should also specifically point out the support for any amendments made to the disclosure. See MPEP § 2163.06. An amendment which does not comply with the provisions of 37 CFR 1.121 (b), (c), (d), and (h) may be held not fully responsive. See MPEP § 714.” Amendments not pointing to specific support in the disclosure may be deemed as not complying with provisions of 37 C.F.R. 1.131 (b), (c), (d), and (h) and therefore held not fully responsive. Generic statements such as "Applicants believe no new matter has been introduced" may be deemed insufficient.
(2) Examiner cites particular columns, paragraphs, figures and line numbers in the references as applied to the claims below for the convenience of the applicant. Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested that, in preparing responses, the applicant fully consider the references in their entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the Examiner.
Response to Arguments
Applicant’s amendments to the Abstract, Specification and claims have overcome each and every objection and 101 rejections previously set forth in the Non-Final Office Action mailed 11/01/2013.
Applicant’s arguments with respect to claims 1, and 19 have been considered but are moot in view of the new ground(s) of rejection (See new reference of Petropoulos).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 5-11, 15, 19, and 25 are rejected under 35 U.S.C. 103 as being unpatentable over Sundaram et al. (U.S. Pub. No. 2021/0232604 A1) in view of Khurana et al. (U.S. Patent No. 12,282,479 B2), further in view of Petropoulos et al. (U.S. Pub. No. 2018/0285418 A1).
Regarding claim 1, Sundaram teaches: system for optimizing queries for a data lake storing data ingested from one or more data lake sources, the data comprising a plurality of data object of different native format ingested from the one or more data lake sources (Fig. 2, paragraph [0029], writing one or more records to a data lake may involve receiving data from one or more data sources and output received data to a data ingestion and/or mutation pipeline; also see paragraph [0036], such pipelines may be used to maintain read and write streams of data, like a message service, for consumption by one or more downstream jobs spawned by the data storage engine; also see paragraph [0032]-[0034], the data pipeline system 200 includes one or more data sources 270 through 272, which provide data to a data ingestion engine service 202; a data storage engine reads the data from the pipelines and stores it in a data lake 214; noted, data sources 270 through 272 are interpreted as “one or more data lake sources; also see paragraph [0039], the data lake 214 may store the data received by the ingested service; also see paragraph [0054] and [0056]; further noted, ‘for’ indicates intended use; Minton v. Nat ’l Ass ’n of Securities Dealers, Inc., 336 F.3d 1373, 1381, 67 USPQ2d 1614, 1620 (Fed. Cir. 2003) “whereby clause in a method claim is not given weight when it simply expresses the intended result of a process step positively recited.” Examples of claim language, although not exhaustive, that may raise a question as to the limiting effect of the language in a claim are: (A) “adapted to” or “adapted for” clauses; (B) “wherein” clauses; and (C) “whereby” clauses. Therefore intended use limitations are not required to be taught, see MPEP 2111.04 [R-3]), the system comprising:
memory separate from the one or more data lake sources (Fig. 2, paragraph [0029], writing one or more records to a data lake may involve receiving data from one or more data sources and output received data to a data ingestion and/or mutation pipeline; also see paragraph [0036], such pipelines may be used to maintain read and write streams of data, like a message service, for consumption by one or more downstream jobs spawned by the data storage engine; noted, “data ingestion and/or mutation pipeline” is interpreted as “memory separate from the one or more data lake sources” ), the memory configured to store:
a plurality of data partitions storing data from the plurality of data objects ingested from the one or more data lake sources of the data lake, the plurality of data partitions partitioned based on a key (paragraph [0039], the data lake 214 may store the data received by the ingested service; also see paragraph [0033], the stored data 220 is separated into some number of partitions including the partition 1 222 through N 224; also see paragraph [0041], the data may be partitioned into buckets by {orgId, engagementDay}…; also see paragraph [0038], data storage engine may store the keys of partitions that were updated in a changed log in the change log repository).
Sundaram does not explicitly disclose: a partition index, […], comprising entries storing values of the key, the value of the key each mapped to one or more of the plurality of data partitions.
Khurana teaches: a partition index, […], comprising entries storing values of the key, the value of the key each mapped to one or more of the plurality of data partitions (Fig. 1, col. 4, line 11-28, the records (112A, 112N) may be distributed over the partition (120A, 120N) of the database based on partition information for the table; the partition information may include one or more fields (e.g., column), referred to a partition field; distinct values of a partition field may be referred to as partition identifiers, where each distinct value identifies (e.g., corresponds to) a specific partition; for example, the partition information may include a single field “geographic region,” where different partitions correspond to different regions; continuing this example, the partition identifiers for the partition field “geographic region” may be either India, US, or Canada, noted, partition information is interpreted as “partition index”; partition identifier such as India, US, or Canada, which are mapped to specific partition “India”, “US”, or “Canada” in associated with partition field “geographic region”, are interpreted as value of the key each mapped to one or more of the plurality of data partitions).
It would have been obvious to one of ordinary skill in art before the effective filing date of the claim invention to include a partition index, […], comprising entries storing values of the key, the value of the key each mapped to one or more of the plurality of data partitions into data lake system of Sundaram.
Motivation to do so would be to include a partition index, […], comprising entries storing values of the key, the value of the key each mapped to one or more of the plurality of data partitions that may reduce the processing burden on the database (Khurana, col. 3, line 22).
Sundaram and Khurana further teach:
said partition index, separate from the one or more data lake sources and the plurality of data partitions (Sundaram, Fig. 2, paragraph [0033], the stored data 220 is separated into some number of partitions including the partition 1 222 through N 224; also see paragraph [0041], the data may be partitioned into buckets by {orgId, engagementDay}…; also see paragraph [0038], data storage engine may store the keys of partitions that were updated in a changed log in the change log repository; Fig. 2 illustrates the storing the keys of partitions in the change log repository 212 is separate from data partition 220 and data sources [270…272], therefore, it reads on as claimed).
Sundaram as modified by Khurana do not explicitly disclose:
a processor, coupled to the memory, the processor configured to execute a query engine for processing queries for the data lake, the query engine configured to: receive, through a communication network, from a client device, a query on target data from the plurality of data objects ingested from the one or more data lake sources;
execute the query on the target data, the execution comprising: identify, using the partition index separate from the one or more data lake sources and the plurality of data partitions, at least one data partition of the plurality of data in which the target data was stored after ingestion from the one or more data lake sources of data lake.
Petropoulos teaches: a processor, coupled to the memory, the processor configured to execute a query engine for processing queries for the data lake, the query engine configured to: receive, through a communication network, from a client device, a query on target data from the plurality of data objects ingested from the one or more data lake sources (Fig. 5, paragraph [0030], one data storage may be an object-based data store that allow for different data objects of different formats or types of data to be stored and managed according to a key value or other unique identifier that identifies the object; data storage service(s) may be treated as data lake; also see paragraph [0034], to execute various queries for data already stored in the data processing service or data stored in a data lake hosted in data storage service(s); also see paragraph [0060], leader node may be a server that receives a query from various client programs(e.g., applications) and/or subscribers (users));
execute the query on the target data, the execution comprising: identify, using the partition index separate from the one or more data lake sources and the plurality of data partitions, at least one data partition of the plurality of data in which the target data was stored after ingestion from the one or more data lake sources of data lake (paragraph [0030], one data storage may be an object-based data store that allow for different data objects of different formats or types of data to be stored and managed according to a key value or other unique identifier that identifies the object; data storage service(s) may be treated as data lake; also see paragraph [0034], to execute various queries for data already stored in the data processing service or data stored in a data lake hosted in data storage service(s); also see Fig. 5, paragraph [0057], ingestion processing may implement partition assignment to determine which partition of not structured data set to store the data objects; the data objects may be stored according to the partition assignment in not-structured data set in the data storage service; each partition may be stored a collection of objects, such as buckets for partition 540a, 540b and 540 n storing data object(s) 542a, 542b, and 542n respectively; also see paragraph [0063], sending request operations to non-structure data processing service; operation requests may be self-describing, identifying the particular portion of data to which the operation is to be applied; requested operation(s) may include a partition identifier for the not-structured distributed data set specifying the partition upon which the operation is to be performed; in combination with said partition index, separate from the one or more data lake sources and the plurality of data partitions taught by Sundaram as shown above, it reads on as claimed).
It would have been obvious to one of ordinary skill in art before the effective filing date of the claim invention to include receiving, through a communication network, from a client device, a query on target data from the plurality of data objects ingested from the one or more data lake sources; executing the query on the target data, the execution comprising: identify, using the partition index separate from the one or more data lake sources and the plurality of data partitions, at least one data partition of the plurality of data in which the target data was stored after ingestion from the one or more data lake sources of data lake into data lake system of Sundaram.
Motivation to do so would be to include receiving, through a communication network, from a client device, a query on target data from the plurality of data objects ingested from the one or more data lake sources; executing the query on the target data, the execution comprising: identify, using the partition index separate from the one or more data lake sources and the plurality of data partitions, at least one data partition of the plurality of data in which the target data was stored after ingestion from the one or more data lake sources of data lake such that clients of common query engine can focus on developing applications or user cases for data without regard to the underlying format of the data (Petropoulos, paragraph [0018], line 14-16).
Sundaram as modified by Khurana and Petropoulos further teach:
identify, based on the query, at least one of the entries of the partition index storing at least one key value mapped to at least one respective data partition of the plurality of data partitions (Khurana, Fig. 2, Fig. 4B, col. 6, line 8-20, the partition-specific database query is executed to obtain database records stored in the table in a partition of the database during the time interval; for example, each partition-specific query may include a “where” clause that restricts the query results to records that include a specific partition identifier for the partition field; validating that each partition-specific database query refers to the partition information (e.g., a partition field) noted, “partition identifier for the partition field”, which is interpreted as “key value mapped to at least one respective data partition of the plurality of data partitions”; while paragraph [0063], sending request operations to non-structure data processing service; operation requests may be self-describing, identifying the particular portion of data to which the operation is to be applied; requested operation(s) may include a partition identifier for the not-structured distributed data set specifying the partition upon which the operation is to be performed; also see Petropoulos, paragraph [0064], also see accessing metadata stored along with data object in the data storage service to check whether the operation will return any result out the data object; for example, range values for timestamp may be used as index to check whether any data objects in a partition have a timestamp value within the range specified by the operation);
and identify, as the at least one data partition in which the target data was stored, the at least one respective data partition mapped to the at least one key value (Khurana, Fig. 2, Fig. 4B, col. 6, line 8-20, the partition-specific database query is executed to obtain database records stored in the table in a partition of the database during the time interval; for example, each partition-specific query may include a “where” clause that restricts the query results to records that include a specific partition identifier for the partition field; validating that each partition-specific database query refers to the partition information (e.g., a partition field) noted, “partition identifier for the partition field”, which is interpreted as “key value mapped to at least one respective data partition of the plurality of data partitions”; while Petropoulos, paragraph [0063], sending request operations to non-structure data processing service; operation requests may be self-describing, identifying the particular portion of data to which the operation is to be applied; requested operation(s) may include a partition identifier for the not-structured distributed data set specifying the partition upon which the operation is to be performed; also see paragraph [0064], accessing metadata stored along with data object in the data storage service to check whether the operation will return any result out the data object; for example, range values for timestamp may be used as index to check whether any data objects in a partition have a timestamp value within the range specified by the operation);
execute the query on the identified at least one data partition to obtain response data at least in part by reading the response data from the memory of the system (Khurana, Fig. 2, Fig. 4B, col. 6, line 8-20, the partition-specific database query is executed to obtain database records stored in the table in a partition of the database during the time interval; for example, each partition-specific query may include a “where” clause that restricts the query results to records that include a specific partition identifier for the partition field; validating that each partition-specific database query refers to the partition information (e.g., a partition field); also see col. 6, line 35-38, a subset of data lake records that include the partition identifier is extracted from the data lake records; while Petropoulos, paragraph [0064], accessing metadata stored along with data object in the data storage service to check whether the operation will return any result out the data object; for example, range values for timestamp may be used as index to check whether any data objects in a partition have a timestamp value within the range specified by the operation; also see paragraph 0065, retrieving data value requested in operations or apply transformation rules to change retrieved data into format understandable by compute node prior to sending the transformed data back as results);
transmit, through the communication network, to the client device, the response data (Petropoulos, paragraph 0065, retrieving data value requested in operations or apply transformation rules to change retrieved data into format understandable by compute node prior to sending the transformed data back as results; also see paragraph [0066], remote data processing client may read, process, or otherwise obtain results from processing nodes).
Regarding claim 5, Sundaram as modified by Khurana and Petropoulos teach all claimed limitations as set forth in rejection of claim 1, further teach wherein the processor is configured to identify, based on the query, the at least one entry of the partition index by performing: identify the at least one key value in the query (Khurana, Fig. 2, Fig. 4B, col. 6, line 8-20, the partition-specific database query is executed to obtain database records stored in the table in a partition of the database during the time interval; for example, each partition-specific query may include a “where” clause that restricts the query results to records that include a specific partition identifier for the partition field; validating that each partition-specific database query refers to the partition information (e.g., a partition field)); and search for the at least one key value in the entries of the partition index (Khurana, Fig. 2, Fig. 4B, col. 6, line 8-20, the partition-specific database query is executed to obtain database records stored in the table in a partition of the database during the time interval; for example, each partition-specific query may include a “where” clause that restricts the query results to records that include a specific partition identifier for the partition field; validating that each partition-specific database query refers to the partition information (e.g., a partition field); also see col. 6, line 35-38, a subset of data lake records that include the partition identifier is extracted from the data lake records).
Regarding claim 6, Sundaram as modified by Khurana and Petropoulos teach all claimed limitations as set forth in rejection of claim 1, further teach wherein the processor is configured to: receive, through the communication network from the one or more data lake sources, one or more data objects to be stored in the data lake (Sundaram, Fig. 2, paragraph [0032]-[0034], the data pipeline system 200 includes one or more data sources 270 through 272, which provide data to a data ingestion engine service 202; a data storage engine reads the data from the pipelines and stores it in a data lake 214); and store data from the one or more data objects in at least one data partition of the plurality of data partitions (Sundaram, paragraph [0039], the data lake 214 may store the data received by the ingested service; also see paragraph [0033], the stored data 220 is separated into some number of partitions including the partition 1 222 through N 224; also see paragraph [0041], the data may be partitioned into buckets by {orgId, engagementDay}…).
Regarding claim 7, Sundaram as modified by Khurana and Petropoulos teach all claimed limitations as set forth in rejection of claim 6, further teach: wherein the processor is configured to: sort the data from the one or more data objects into the at least one data partition using one or more values of the key in the data (Sundaram, paragraph [0039], the data lake 214 may store the data received by the ingested service; also see paragraph [0033], the stored data 220 is separated into some number of partitions including the partition 1 222 through N 224; also see paragraph [0041], the data may be partitioned into buckets by {orgId, engagementDay}…).
Regarding claim 8, Sundaram as modified by Khurana and Petropoulos teach all claimed limitations as set forth in rejection of claim 1, further teach wherein the processor is configured to partition the data from the plurality of data objects into the plurality of data partitions (Sundaram, paragraph [0039], the data lake 214 may store the data received by the ingested service; also see paragraph [0033], the stored data 220 is separated into some number of partitions including the partition 1 222 through N 224; also see paragraph [0041], the data may be partitioned into buckets by {orgId, engagementDay}…).
Regarding claim 9, Sundaram as modified by Khurana and Petropoulos teach all claimed limitations as set forth in rejection of claim 8, further teach wherein the processor is configured to partition the data into the plurality of data partitions by performing: determine the key; and partition the data from the data objects originating from the one or more data lake sources based on the key (Sundaram, paragraph [0039], the data lake 214 may store the data received by the ingested service; also see paragraph [0033], the stored data 220 is separated into some number of partitions including the partition 1 222 through N 224; also see paragraph [0041], the data may be partitioned into buckets by {orgId, engagementDay}…; also see paragraph [0038], data storage engine may store the keys of partitions that were updated in a changed log in the change log repository).
Regarding claim 10, Sundaram as modified by Khurana and Petropoulos teach all claimed limitations as set forth in rejection of claim 9, further teach wherein the processor is configured to determine the key by performing: receive user input indicating one or more fields of the data to be used as the key (Petropoulos, Fig. 5, paragraph [0030], one data storage may be an object-based data store that allow for different data objects of different formats or types of data to be stored and managed according to a key value or other unique identifier that identifies the object; data storage service(s) may be treated as data lake; also see paragraph [0034], to execute various queries for data already stored in the data processing service or data stored in a data lake hosted in data storage service(s); also see paragraph [0060], leader node may be a server that receives a query from various client programs(e.g., applications) and/or subscribers (users); also see paragraph [0064], also see accessing metadata stored along with data object in the data storage service to check whether the operation will return any result out the data object; for example, range values for timestamp may be used as index to check whether any data objects in a partition have a timestamp value within the range specified by the operation; in combination with the teaching of Khurana, Fig. 2, Fig. 4B, col. 6, line 8-20, the partition-specific database query is executed to obtain database records stored in the table in a partition of the database during the time interval; for example, each partition-specific query may include a “where” clause that restricts the query results to records that include a specific partition identifier for the partition field; validating that each partition-specific database query refers to the partition information (e.g., a partition field), it reads on as claimed).
Regarding claim 11, Sundaram as modified by Khurana and Petropoulos teach all claimed limitations as set forth in rejection of claim 9, further teach wherein the processor is configured to determine the key by performing: determine at least one field in the data expected to be used in queries received by the system; and use the at least one field as the key (Khurana, Fig. 2, Fig. 4B, col. 6, line 8-20, the partition-specific database query is executed to obtain database records stored in the table in a partition of the database during the time interval; for example, each partition-specific query may include a “where” clause that restricts the query results to records that include a specific partition identifier for the partition field; validating that each partition-specific database query refers to the partition information (e.g., a partition field); also see col. 6, line 35-38, a subset of data lake records that include the partition identifier is extracted from the data lake records).
Regarding claim 15, Sundaram as modified by Khurana and Petropoulos teach all claimed limitations as set forth in rejection of claim 1, further teach wherein the plurality of data partitions are configured to store at least some of the data in a columnar storage format (Sundaram, Fig. 2, paragraph [0032]-[0034], the data pipeline system 200 includes one or more data sources 270 through 272, which provide data to a data ingestion engine service 202; a data storage engine reads the data from the pipelines and stores it in a data lake 214; also see paragraph [0067], one upstream source may have data lake in Databricks Delta in parquet format on amazon S3…another data source may have data stored with Azure blob storage in Avro file format…; noted, ‘parquet’ is interpreted as columnar storage format; it is noted that one of the ordinary skill in the art would know that “Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval”).
As per claims 19, this claim is rejected on grounds corresponding to the same rationales given above for rejected claim 1 and is similarly rejected.
Regarding claim 25, Sundaram as modified by Khurana and Petropoulos teach all claimed limitations as set forth in rejection of claim 1, further teach wherein the processor is further configured to execute an ingestion engine, the ingestion engine configured to: ingest the data objects from the one or more data lake sources of the data lake (Sundaram, Fig. 2, paragraph [0029], writing one or more records to a data lake may involve receiving data from one or more data sources and output received data to a data ingestion and/or mutation pipeline; also see paragraph [0032]-[0034], the data pipeline system 200 includes one or more data sources 270 through 272, which provide data to a data ingestion engine service 202; a data storage engine reads the data from the pipelines and stores it in a data lake 214; also see paragraph [0067], one upstream source may have data lake in Databricks Delta in parquet format on amazon S3…another data source may have data stored with Azure blob storage in Avro file format…; noted, data sources 270 through 272 is interpreted as one or more data lake sources).
Claims 2-4, 20 and 27 are rejected under 35 U.S.C. 103 as being unpatentable over Sundaram et al. (U.S. Pub. No. 2021/0232604 A1) in view of Khurana et al. (U.S. Patent No. 12,282,479 B2), and Petropoulos et al. (U.S. Pub. No. 2018/0285418 A1), further in view of Horowitz et al. (U.S. Pub. No. 2012/0254175 A1).
Regarding claim 2, Sundaram as modified by Khurana and Petropoulos teach all claimed limitations as set forth in rejection of claim 1, but do not explicitly disclose: wherein the plurality of data partitions are stored in a plurality of shards associated with respective ranges of a shard key.
Horowitz teaches: wherein the plurality of data partitions are stored in a plurality of shards associated with respective ranges of a shard key (paragraph [0006], the partition component is further configured to define the first partition having a minimum key value and a maximum key value…; the database is organized into a plurality of collections includes a continuous range of data from the database, wherein the contiguous range comprises a range of one or more key values associated with the database data; also see paragraph [0007], the partition component is further configured to assign at least any data in the at least one of the plurality of database partitions having associated database key less than the maximum value to the first partition, and assign at least any data in the at least one of the plurality of database partition having database key values greater that the maximum value to the second partition; the partition component is further configured to identify database partitions having a sequential database key).
It would have been obvious to one of ordinary skill in art before the effective filing date of the claim invention to include wherein the plurality of data partitions are stored in a plurality of shards associated with respective ranges of a shard key into data lake system of Sundaram.
Motivation to do so would be to include wherein the plurality of data partitions are stored in a plurality of shards associated with respective ranges of a shard key to minimize overhead associated with maintained sharded data (Horowitz, paragraph [0003], line 12-14).
Regarding claim 3, Sundaram as modified by Khurana, Petropoulos and Horowitz teach all claimed limitations as set forth in rejection of claim 2, further teach: wherein the key based on which the plurality of data partitions are partitioned is the shard key (Horowitz, paragraph [0006], the partition component is further configured to define the first partition having a minimum key value and a maximum key value…; the database is organized into a plurality of collections includes a continuous range of data from the database, wherein the contiguous range comprises a range of one or more key values associated with the database data; also see paragraph [0007], the partition component is further configured to assign at least any data in the at least one of the plurality of database partitions having associated database key less than the maximum value to the first partition, and assign at least any data in the at least one of the plurality of database partition having database key values greater that the maximum value to the second partition; the partition component is further configured to identify database partitions having a sequential database key).
Regarding claim 4, Sundaram as modified by Khurana, Petropoulos and Horowitz teach all claimed limitations as set forth in rejection of claim 2, further teach: wherein the processor is configured to perform rebalancing of the plurality of data partitions among the plurality of shards (Horowitz, paragraph [0077]-[0078], rebalancing chunk distribution within a shard cluster….).
As per claims 20, this claim is rejected on grounds corresponding to the same rationales given above for rejected claims 2 and is similarly rejected.
Regarding claim 27, Sundaram as modified by Khurana, and Petropoulos teach all claimed limitations as set forth in rejection of claim 1, but do not explicitly disclose wherein the processor is further configured to perform rebalancing of the plurality of data partitions among the plurality of shards by performing: determine whether a shard of the plurality of shards has reached a threshold size; and divide the shard into multiple shards when it is determined that the shard has reached the threshold size.
Horowitz teaches: wherein the processor is further configured to perform rebalancing of the plurality of data partitions among the plurality of shards by performing: determine whether a shard of the plurality of shards has reached a threshold size (Fig. 8, paragraph [0097], when the data within the chunk exceeds a size threshold; also see paragraph [0047]); and divide the shard into multiple shards when it is determined that the shard has reached the threshold size (Fig. 8, paragraph [0097], when the data within the chunk exceeds a size threshold, the shard cluster is configured to split the chunk into two partitions).
It would have been obvious to one of ordinary skill in art before the effective filing date of the claim invention to include wherein the processor is further configured to perform rebalancing of the plurality of data partitions among the plurality of shards by performing: determine whether a shard of the plurality of shards has reached a threshold size; and divide the shard into multiple shards when it is determined that the shard has reached the threshold size into data lake system of Sundaram.
Motivation to do so would be to include wherein the processor is further configured to perform rebalancing of the plurality of data partitions among the plurality of shards by performing: determine whether a shard of the plurality of shards has reached a threshold size; and divide the shard into multiple shards when it is determined that the shard has reached the threshold size to address issue with the necessary processing capacity increases, the cost to further increase capacity can grow exponentially (Horowitz, paragraph [0001], line 10-12).
Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Sundaram et al. (U.S. Pub. No. 2021/0232604 A1) in view of Khurana et al. (U.S. Patent No. 12,282,479 B2), and Petropoulos et al. (U.S. Pub. No. 2018/0285418 A1), further in view of Dudami et al. (U.S. Patent No. 10,846,307 B1).
Regarding claim 12, Sundaram as modified by Khurana, and Petropoulos teach all claimed limitations as set forth in rejection of claim 1, but do not explicitly disclose: wherein the processor is configured to group, in the memory, data from at least some of the plurality of data objects that share a native format.
Dudami teaches: wherein the processor is configured to group, in the memory, data from at least some of the plurality of data objects that share a native format (col. 3, line 53-64, the data lake is a data repository that store large amounts of raw data in the native format of the raw data until that data is needed by other entities; also see col. 4, line 24-40, obtains raw data from the electronic data source 102; by raw data, it is meant data in its native format that has not been transformed into a different format; when the first metadata element matches the raw data 106, the raw data 106 is streamed into a first data storage structure in the data lake 108; by ‘match’, it is meant whether the incoming raw data has all that is required (e.g., all the required fields, elements, or identifier) by metadata…; noted, raw data [native format data object] that ‘match’ are streamed to specific data storage structure; thus it indicates that raw data [native format data object] with shared native format [e.g., all the required fields, elements, or identifier] that are stored together, which reads on wherein the processor is configured to group, in the memory, data from at least some of the plurality of data objects that share a native format as claimed).
It would have been obvious to one of ordinary skill in art before the effective filing date of the claim invention to include wherein the processor is configured to group, in the memory, data from at least some of the plurality of data objects that share a native format into data lake system of Sundaram.
Motivation to do so would be to include wherein the processor is configured to group, in the memory, data from at least some of the plurality of data objects that share a native format to overcome issue with difficult to efficiently retrieve and perform operations on the data (Dudami, col. 1, line 14-16).
Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Sundaram et al. (U.S. Pub. No. 2021/0232604 A1) in view of Khurana et al. (U.S. Patent No. 12,282,479 B2), and Petropoulos et al. (U.S. Pub. No. 2018/0285418 A1), further in view of Tormasov et al. (U.S. Pub. No. 2020/0174893 A1).
Regarding claim 13, Sundaram as modified by Khurana, and Petropoulos teach all claimed limitations as set forth in rejection of claim 1, but do not explicitly disclose: wherein one or more of the plurality of data partitions comprise a plurality of files, and the processor is further configured to: determine that the files contain less than a threshold amount of data; and combine the plurality of files into a single file.
Tormasov teaches: wherein one or more of the plurality of data partitions comprise a plurality of files, and the processor is further configured to: determine that the files contain less than a threshold amount of data; and combine the plurality of files into a single file (paragraph [0045], in response to determine that file size of the file is less than fil-size threshold,…may combine the file with other files into a data blob…).
It would have been obvious to one of ordinary skill in art before the effective filing date of the claim invention to include wherein one or more of the plurality of data partitions comprise a plurality of files, and the processor is further configured to: determine that the files contain less than a threshold amount of data; and combine the plurality of files into a single file into data lake system of Sundaram.
Motivation to do so would be to include wherein one or more of the plurality of data partitions comprise a plurality of files, and the processor is further configured to: determine that the files contain less than a threshold amount of data; and combine the plurality of files into a single file such that data packing into blobs for efficient storage (Tormasov, paragraph [0002]).
Claims 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over Sundaram et al. (U.S. Pub. No. 2021/0232604 A1) in view of Khurana et al. (U.S. Patent No. 12,282,479 B2), and Petropoulos et al. (U.S. Pub. No. 2018/0285418 A1), further in view of Wilson et al. (U.S. Pub. No. 2020/0301941 A1).
Regarding claim 17, Sundaram as modified by Khurana, and Petropoulos teach all claimed limitations as set forth in rejection of claim 1, but do not explicitly disclose: wherein the processor is configured to: receive, from the communication network, from the client device, a query for metadata about a set of data, the metadata stored in the partition index.
Wilson teaches: wherein the processor is configured to: receive, from the communication network, from the client device, a query for metadata about a set of data, the metadata stored in the partition index (paragraph [0122], information about files in data lake, including user-specified information (e.g., bucket/directories, meta-data, etc.) and/or information included in the directory structure, files, and/or file names) can be used to build an partition specification…; the database system can use the partitions to improve a data lake query; for example, if the query includes a data aspect used to build the partitions (e.g., a data or time field of a virtual collection that is inherited from the file names), the database can leverage such information in the partition to go to the desired file).
It would have been obvious to one of ordinary skill in art before the effective filing date of the claim invention to include wherein the processor is configured to: receive, from the communication network, from the client device, a query for metadata about a set of data, the metadata stored in the partition index into data lake system of Sundaram.
Motivation to do so would be to include wherein the processor is configured to: receive, from the communication network, from the client device, a query for metadata about a set of data, the metadata stored in the partition index to instantiating one or more computing resources to process query using the virtual collection to generate the response to the query (Wilson, paragraph [0039], line 10-12).
Sundaram as modified by Khurana, Petropoulos and Wilson further teach:
execute the query by reading the metadata from the partition index without accessing a data partition of the plurality of data partitions (Wilson, paragraph [0122], information about files in data lake, including user-specified information (e.g., bucket/directories, meta-data, etc.) and/or information included in the directory structure, files, and/or file names) can be used to build an partition specification…; the database system can use the partitions to improve a data lake query; for example, if the query includes a data aspect used to build the partitions (e.g., a data or time field of a virtual collection that is inherited from the file names), the database can leverage such information in the partition to go to the desired file; also see paragraph [0017], a partition associated with a portion of query using a range of a field; also see paragraph [0090], identifying partitions; a partition can be determined and/or specified based on a value of a field, a range of values of a field…; also see paragraph [0093], if a query selects or filters documents with an asOfDateTime field in this range, n, the partition specification provides information allowing agent servers to filer or prune data that is read such that the portion of the query can be executed using this particular partition file…);
and transmit, through the communication network, to the client device, the metadata (Wilson, paragraph [0122], information about files in data lake, including user-specified information (e.g., bucket/directories, meta-data, etc.) and/or information included in the directory structure, files, and/or file names) can be used to build an partition specification…; the database system can use the partitions to improve a data lake query; for example, if the query includes a data aspect used to build the partitions (e.g., a data or time field of a virtual collection that is inherited from the file names), the database can leverage such information in the partition to go to the desired file; also see paragraph [0124], the query service nodes can divide and conquer a query by processing data from the files within the customer S3 buckets; the results of the query service nodes can be merge to create a result set; the result set can be returned to the user).
Regarding claim 18, Sundaram as modified by Khurana, and Petropoulos teach all claimed limitations as set forth in rejection of claim 1, but do not explicitly disclose: wherein: at least some of the data objects are stored in respective virtual collections.
Wilson teaches: wherein: at least some of the data objects are stored in respective virtual collections (paragraph [0009], virtual ‘collections’ of distributed data object can be specified and queried in a manner that is directly analogues to querying collections in a document database system; also see paragraph [0055], the data lake can be implemented as a database with logical organizations of subsets of database data in virtual collections; also see paragraph [0059], the system builds virtual collections within the object data storage…; also see paragraph [0082], the storage configuration file can provide a flexible mapping of data, including combining individual objects, files and/or collection into virtual collections).
It would have been obvious to one of ordinary skill in art before the effective filing date of the claim invention to include wherein: at least some of the data objects are stored in respective virtual collections into data lake system of Sundaram.
Motivation to do so would be to include wherein: at least some of the data objects are stored in respective virtual collections to instantiating one or more computing resources to process query using the virtual collection to generate the response to the query (Wilson, paragraph [0039], line 10-12).
Sundaram as modified by Khurana, Petropoulos and Wilson further teach:
the processor is further configured to: receive, through the communication network, from the client device, a second query on second target data in a first virtual collection (Wilson, paragraph [0009], virtual ‘collections’ of distributed data object can be specified and queried in a manner that is directly analogues to querying collections in a document database system; also see paragraph [0127], a query comes in from a user to a query service node; the query service node determines the type of query is a data lake query, and determines the virtual collection for the data lake);
transmit, through the communication network, to a data storage system associated with the first virtual collection, information indicating the second query (Wilson, paragraph [0009], virtual ‘collections’ of distributed data object can be specified and queried in a manner that is directly analogues to querying collections in a document database system; also see paragraph [0127], a query comes in from a user to a query service node; the query service node determines the type of query is a data lake query, and determines the virtual collection for the data lake; if the query can divided, the query service nodes spreads the work across multiple query service nodes; the result(s) from each of the query service nodes(s) are aggregated to provide a set of results in response to the user query);
receive, through the communication network, from the data storage system, second response data obtained from executing the second query (Wilson, paragraph [0009], virtual ‘collections’ of distributed data object can be specified and queried in a manner that is directly analogues to querying collections in a document database system; also see paragraph [0127], a query comes in from a user to a query service node; the query service node determines the type of query is a data lake query, and determines the virtual collection for the data lake; if the query can divided, the query service nodes spreads the work across multiple query service nodes; the result(s) from each of the query service nodes(s) are aggregated to provide a set of results in response to the user query);
and transmit, through the communication network, to the client device, the second response data (paragraph [0009], virtual ‘collections’ of distributed data object can be specified and queried in a manner that is directly analogues to querying collections in a document database system; also see paragraph [0127], a query comes in from a user to a query service node; the query service node determines the type of query is a data lake query, and determines the virtual collection for the data lake; if the query can divided, the query service nodes spreads the work across multiple query service nodes; the result(s) from each of the query service nodes(s) are aggregated to provide a set of results in response to the user query; also see paragraph [0124], the query service nodes can divide and conquer a query by processing data from the files within the customer S3 buckets; the results of the query service nodes can be merge to create a result set; the result set can be returned to the user).
Claim 23 is rejected under 35 U.S.C. 103 as being unpatentable over Sundaram et al. (U.S. Pub. No. 2021/0232604 A1) in view of Khurana et al. (U.S. Patent No. 12,282,479 B2), and Petropoulos et al. (U.S. Pub. No. 2018/0285418 A1), further in view of Rahle (U.S. Pub. No. 2022/0067021 A1).
Regarding claim 23, Sundaram as modified by Khurana, and Petropoulos teach all claimed limitations as set forth in rejection of claim 1, but do not explicitly disclose: wherein the partition index stores statistics for data stored in the data partition, the statistics determined based on field values stored in the partition; and executing the query on the identified at least one data partition to obtain the response data comprises reading statistics stored in an entry associated with the data partition as at least a portion the response data.
Rahle teaches: wherein the partition index stores statistics for data stored in the data partition, the statistics determined based on field values stored in the partition (Fig. 2, paragraph [0004], the system partitions time-based data items into segment based on one or more metadata criteria (e.g., product, status, deployment, environment, version, host. Etc.); the system may then group the data items by time intervals, and calculate one or more statistical attributes for each of the segments within each of the time windows; this statistical data may be stored in association with the corresponding segment and time interval for access by one or more front end software applications); and executing the query on the identified at least one data partition to obtain the response data comprises reading statistics stored in an entry associated with the data partition as at least a portion the response data (paragraph [0019], the system allows efficient identification [querying] and presentation [displaying] of statistical information for particular time period from a large set of temporally ordered events; also see paragraph [0040], using the data insight system and method disclosed herein, statistical attributes for that same 1 minute time window that are stored in a data insights database may be quickly and efficiently accessed).
It would have been obvious to one of ordinary skill in art before the effective filing date of the claim invention to include wherein the partition index stores statistics for data stored in the data partition, the statistics determined based on field values stored in the partition; and executing the query on the identified at least one data partition to obtain the response data comprises reading statistics stored in an entry associated with the data partition as at least a portion the response data into data lake system of Sundaram.
Motivation to do so would be to include wherein the partition index stores statistics for data stored in the data partition, the statistics determined based on field values stored in the partition; and executing the query on the identified at least one data partition to obtain the response data comprises reading statistics stored in an entry associated with the data partition as at least a portion the response such that the system allows efficient identification [querying] and presentation [displaying] of statistical information for particular time period from a large set of temporally ordered events (Rahle, paragraph [0019]).
Claim 24 is rejected under 35 U.S.C. 103 as being unpatentable over Sundaram et al. (U.S. Pub. No. 2021/0232604 A1) in view of Khurana et al. (U.S. Patent No. 12,282,479 B2), and Petropoulos et al. (U.S. Pub. No. 2018/0285418 A1), further in view of Barber et al. (U.S. Pub. No. 2013/0325901 A1).
Regarding claim 24, Sundaram as modified by Khurana, and Petropoulos teach all claimed limitations as set forth in rejection of claim 1, but do not explicitly disclose a first portion of the data in a first storage format from the plurality of data objects to optimize queries expected on the first portion of the data; a second portion of the data from the plurality of data objects in a second storage format different from the first storage format to optimize queries expected on the first portion of the data.
Barber teaches: a first portion of the data in a first storage format from the plurality of data objects to optimize queries expected on the first portion of the data; a second portion of the data from the plurality of data objects in a second storage format different from the first storage format to optimize queries expected on the first portion of the data (paragraph [0043], assign the data values 120 to partition based on a representation or encoding format; assigns the data values to a first partition and a second partition 118; the first partition include data values encoded to a first encoding format and the second partition may include unencoded data value; assign data value to third partition which includes data values encoded in second encoding format).
It would have been obvious to one of ordinary skill in art before the effective filing date of the claim invention to include a first portion of the data in a first storage format from the plurality of data objects to optimize queries expected on the first portion of the data; a second portion of the data from the plurality of data objects in a second storage format different from the first storage format to optimize queries expected on the first portion of the data into data lake system of Sundaram.
Motivation to do so would be to include a first portion of the data in a first storage format from the plurality of data objects to optimize queries expected on the first portion of the data; a second portion of the data from the plurality of data objects in a second storage format different from the first storage format to optimize queries expected on the first portion of the data to overcome issue with particular encoding configuration can have a negative impact on query performance (Barber, paragraph [0003], line 10-11).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KEN HOANG whose telephone number is (571)272-8401. The examiner can normally be reached M-F 7:30am-5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Charles Rones can be reached at (571)272-4085. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/KEN HOANG/ Examiner, Art Unit 2168
/CHARLES RONES/ Supervisory Patent Examiner, Art Unit 2168