The need for editing historical data
Editing historical data can be necessary for several reasons:- Error correction: Mistakes in data entry or processing can lead to inaccuracies. Correcting these errors ensures the data reflects true events or values.
- Compliance and regulations: Certain industries have strict regulations that may require updates to historical data to remain compliant, especially if the data impacts financial reporting or legal obligations.
- Improved analysis: As new insights or methodologies emerge, historical data may need to be adjusted to ensure analyses are relevant and accurate, allowing for better decision-making.
- Data quality enhancement: Over time, organizations may refine their data quality standards. Updating historical records can help align them with current standards.
- Business strategy changes: Shifts in business strategy might necessitate alterations in historical data for better reporting or strategic analysis.
🚧 Warning Editing historical data in a database should be approached with caution, as it can affect data integrity and accuracy.
How to edit historical data
How you edit the data depends on how the data was ingested by Zenskar:- If you used usage event APIs to send data to Zenskar, refer to the section Edit historical data using usage-event APIs.
- If you used data-source connectors, refer to the following sections conditionally:
Edit historical data using usage-event APIs
Edit historical data when billable metrics are derived using the COUNT SQL aggregate function
In this example, you rent out your office space on a per day basis. The following data schema is used for sending usage events to Zenskar:
JSON
| data.Id | data.was_office_space_used | timestamp | customer_id |
|---|---|---|---|
| c01 | Yes | 2023-04-28T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
| c02 | Yes | 2023-04-29T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
| c03 | Yes | 2023-04-30T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
COUNT aggregate function on the data.was_office_space_used column to calculate the billable metric.
Breakdown of the SQL query
SELECT clause:
COUNT: This function counts non-null values.CASEstatement: checks if the columndata.was_office_space_usedhas the valueYes. If it does, it returns 1; otherwise, it returns null.COALESCE: this function returns the first non-null value in the list. If there are no instances wheredata.was_office_space_usedisYes,COUNTwould return0, andCOALESCEensures that it is treated as such.
FROM clause:
your_table with the actual name of the table containing your data.
WHERE Clause:
DATE("timestamp"): This converts the timestamp field to a date format, ensuring the comparison works with date values.DATE({{start_date}}) and DATE({{end_date}}): These placeholders represent the start and end dates for filtering.customer_id = CAST({{customer.external_id}} AS String): This checks if thecustomer_idin the table matches the providedexternal_id, which is also cast to a string to ensure type compatibility.
Method 1: modify the data schema by introducing a new data field
🔖 Note The examples given in this document are for reference only. The ideas given herein are guidelines and not rules. You must adapt the ideas given in this document as per your use case.You realize that an error was made in data entry. To correct the error, you decide to introduce a new field in the data schema. In this example, we will use a
boolean type.
JSON
| data.Id | data.was_office_space_used | data.deduct_from_count | timestamp | customer_id |
|---|---|---|---|---|
| c01 | Yes | 2023-04-28T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 | |
| c02 | Yes | 2023-04-29T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 | |
| c03 | Yes | 2023-04-30T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 | |
| c04 | No | True | 2023-05-01T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
data.was_office_space_used column by taking into account the count of the newly introduced data.deduct_from_count column:
Breakdown of the SQL query
SELECT clause:
COALESCE(...): This function returns the first non-null value in its list of arguments. Here, it is used to ensure that if the result of the subtraction is null, which can happen if there are no rows, it will return0instead.COUNT(data.was_office_space_used): This counts the number of non-null entries in thedata.was_office_space_usedcolumn.COUNT(NULLIF(data.deduct_from_count, '')): TheNULLIFfunction returnsNULLifdata.deduct_from_countis an empty string (”), effectively counting only non-empty entries. So, this counts the number of non-empty entries indata.deduct_from_count. The entire expression calculates the difference between the count ofdata.was_office_space_usedand the count of non-emptydata.deduct_from_count.
FROM clause specifies the table from which the data is being queried.
WHERE clause:
DATE("timestamp") >= DATE({{start_date}}) AND DATE("timestamp") <= DATE({{end_date}}): this filters the results to include only those records where the timestamp is within the specified date range, defined by the variables{{start_date}}and{{end_date}}."customer_id" = CAST({{customer.external_id}} AS String): This filters the results to include only rows where thecustomer_idmatches theexternal_id, after convertingexternal_idto a string.
Method 2: introduce a new usage event for deduction
Let us assume that you used the following data schema for sending usage events:JSON
| data.Id | data.Premises_used | timestamp | customer_id |
|---|---|---|---|
| c01 | Yes | 2023-04-28T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
| c02 | Yes | 2023-04-29T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
| c03 | Yes | 2023-04-30T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
JSON
| data.Id | data.Error | timestamp | customer_id |
|---|---|---|---|
| c01 | Yes | 2023-05-15T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
| c02 | Yes | 2023-05-29T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
Edit historical data when billable metrics are derived using the SUM aggregate function
Let us assume that you used the following data schema for sending usage events:
JSON
| data.Id | data.Usage | timestamp | customer_id |
|---|---|---|---|
| c01 | 301.4 | 2023-04-28T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
| c02 | 500 | 2023-04-29T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
| c03 | 104.8 | 2023-04-30T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
data.Id equaling c02 should be 475 and not 500. In this case, you can make another usage event API call to add another unique data.Id with a data.Usage of -25, as shown below.
| data.Id | data.Usage | timestamp | customer_id |
|---|---|---|---|
| c01 | 301.4 | 2023-04-28T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
| c02 | 500 | 2023-04-29T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
| c03 | 104.8 | 2023-04-30T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
| c04 | -25 | 2023-05-01T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
SUM function, the negative entry in the data.Usage column will account for the data error.
Edit historical data when billable metrics are derived using the MAX, AVG, MIN, and UNIQUE COUNT aggregate functions
Techniques similar to the ones mentioned for COUNT and SUM aggregate functions can be used for MAX, AVG, MIN, and UNIQUE COUNT` aggregate functions.
Editing historical data using ROW_NUMBER() OVER (PARTITION BY) SQL function
The ROW_NUMBER() function in SQL is a window function that assigns a unique sequential integer to rows within a partition of a result set. It is often used to uniquely identify rows within groups of data. The PARTITION BY clause is used to define how the rows are divided into groups.
Breakdown of ROW_NUMBER() OVER (PARTITION BY)
ROW_NUMBER(): generates a unique number for each row in the result set, starting from 1 for the first row in each partition.OVER: specifies the window over which the function operates. It can include:PARTITION BY: defines the groups (partitions) within the data. Each partition is treated independently when generating row numbers.
| data.Id | data.Usage | timestamp | customer_id |
|---|---|---|---|
| c01 | 301.4 | 2023-04-28T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
| c02 | 500 | 2023-04-29T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
| c03 | 104.8 | 2023-04-30T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
data.Id equaling c02, data.Usage should be 475 and not 500. In this case, you can make another usage event API call to add a row with data.Id equaling c02, data.Usage of 475, and latest timestamp, as shown below.
| data.Id | data.Usage | timestamp | customer_id |
|---|---|---|---|
| c01 | 301.4 | 2023-04-28T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
| c02 | 500 | 2023-04-29T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
| c02 | 475 | 2023-05-01T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
| c03 | 104.8 | 2023-04-30T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
| data.Id | data.Usage | timestamp | customer_id |
|---|---|---|---|
| c01 | 301.4 | 2023-04-28T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
| c02 | 475 | 2023-05-01T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |
| c03 | 104.8 | 2023-04-30T13:26:05.017000 | acc93335-aabb-43e9-aabb-138ac880b715 |