From d3a84cac1b77f62c4fb6cbf394b809bb286bdc0e Mon Sep 17 00:00:00 2001 From: Osric Wilkinson Date: Mon, 6 Oct 2025 10:37:00 +0100 Subject: [PATCH 1/9] Adds line breaks to logging policy Makes it easier to read/edit the markdown. --- .../Logging/LoggingPolicy.md | 180 +++++++++++++----- 1 file changed, 136 insertions(+), 44 deletions(-) diff --git a/software-engineering-policies/Logging/LoggingPolicy.md b/software-engineering-policies/Logging/LoggingPolicy.md index 2433ae39..c4446056 100644 --- a/software-engineering-policies/Logging/LoggingPolicy.md +++ b/software-engineering-policies/Logging/LoggingPolicy.md @@ -1,39 +1,66 @@ # Logging Policy -This policy aims to provide guidance to software engineering teams on how the services they develop or support should be logging. Given our logging repository of choice is ElasticSearch, some of the guidance will be influenced by that service. +This policy aims to provide guidance to software engineering teams on how the +services they develop or support should be logging. Given our logging +repository of choice is ElasticSearch, some of the guidance will be influenced +by that service. ## Disclaimer -Not all systems will be using or able to use Elastic logging due to environemntal issues. In these cases the policy is not relevant. +Not all systems will be using or able to use Elastic logging due to +environemntal issues. In these cases the policy is not relevant. ## Log types and levels -This section will cover the different log types that services are expected to produce, what log levels they should be at, how they should be indexed in ElasticSearch, retention periods and any other related guidance. +This section will cover the different log types that services are expected to +produce, what log levels they should be at, how they should be indexed in +ElasticSearch, retention periods and any other related guidance. -Services can produce logs that fall into three main categories: diagnostic, audit, and request/response. Each type will have an expected level and ElasticSearch index pattern that is named according to the following convention: **FullServiceName-environment-category**. +Services can produce logs that fall into three main categories: diagnostic, +audit, and request/response. Each type will have an expected level and +ElasticSearch index pattern that is named according to the following +convention: **FullServiceName-environment-category**. ### Diagnostic logs - Error, warning logs - Expected minimum log levels for Production: Warning or Error -- Should ingest into a dedicated ElasticSearch index named as e.g., SalesCatalogueService-Dev1-Diagnostic + +- Should ingest into a dedicated ElasticSearch index named as e.g., +`SalesCatalogueService-Dev1-Diagnostic` ### Audit/Metric logs -- Expected minimum log levels for Production: Information, but may differ based on the individual type of audit or metric log -- Should ingest into a dedicated ElasticSearch index named as e.g., SalesCatalogueService-Dev1-Audit +- Expected minimum log levels for Production: Information, but may differ based +on the individual type of audit or metric log + +- Should ingest into a dedicated ElasticSearch index named as e.g., +`SalesCatalogueService-Dev1-Audit` ### Request/response logging - Expected minimum log levels for Production: Information -- Should ingest into a dedicated ElasticSearch index named as e.g., SalesCatalogueService-Dev1-HTTP + +- Should ingest into a dedicated ElasticSearch index named as e.g., +`SalesCatalogueService-Dev1-HTTP` ### Further related guidance -- Healthcheck logging is not essential and can cause log saturation. We should prefer instead to use healthcheck endpoints that can be polled. -- Consider the value of logs – e.g., we don’t need to log that an Azure Event Hub is in existence and healthy. -- Avoid logging binary data to ElasticSearch. We should instead log a reference to the object that is stored in blob storage. -- Consideration needs to be given to what is included in logs. Teams should avoid logging many different properties in the hope that they will then have captured everything. -- On-premise services need to have known mitigations for failures that may occur upon trying to ingest logs to Elastic (e.g. log to EventViewer, send an email notification to supporting team). +- Healthcheck logging is not essential and can cause log saturation. We should +prefer instead to use healthcheck endpoints that can be polled. + +- Consider the value of logs – e.g., we don’t need to log that an Azure Event +Hub is in existence and healthy. + +- Avoid logging binary data to ElasticSearch. We should instead log a reference +to the object that is stored in blob storage. + +- Consideration needs to be given to what is included in logs. Teams should +avoid logging many different properties in the hope that they will then have +captured everything. + +- On-premise services need to have known mitigations for failures that may +occur upon trying to ingest logs to Elastic (e.g. log to EventViewer, send an +email notification to supporting team). *** @@ -41,20 +68,33 @@ Services can produce logs that fall into three main categories: diagnostic, audi ### Elastic Cloud -By default, services that are using the UKHO’s Elastic Cloud instance will come under the following retention policy: +By default, services that are using the UKHO’s Elastic Cloud instance will come +under the following retention policy: + +- Logs will remain in the "Hot" phase for 7 days. This tier provides the best +indexing and search performance. + +- After 7 days, logs will move to the "Warm" tier. This tier is optimal for +data that is still likely to be searched, but infrequently updated. + +- After 30 days, logs will move to the "Cold" tier. This tier is used when +searching data less often and where we don’t need to update it. + +- After 90 days, logs will move to the "Frozen" tier for a longer term +retention. This is the most cost-effective way to store data and still be able +to search it. -- Logs will remain in the "Hot" phase for 7 days. This tier provides the best indexing and search performance. -- After 7 days, logs will move to the "Warm" tier. This tier is optimal for data that is still likely to be searched, but infrequently updated. -- After 30 days, logs will move to the "Cold" tier. This tier is used when searching data less often and where we don’t need to update it. -- After 90 days, logs will move to the "Frozen" tier for a longer term retention. This is the most cost-effective way to store data and still be able to search it. - After 365 days, logs will be deleted. ### On-premise ElasticSearch (legacy) Currently the retention policy for logs differs by Elastic Search instance: -- Engineering – once indexes are created, they are subject to a 90 day deletion policy. -- Live – logs are retained indefinitely. Although a policy exists that will delete logs after a year, this is only used by one service. +- Engineering – once indexes are created, they are subject to a 90 day deletion +policy. + +- Live – logs are retained indefinitely. Although a policy exists that will +delete logs after a year, this is only used by one service. *** @@ -62,54 +102,89 @@ Currently the retention policy for logs differs by Elastic Search instance: This section will cover how to test the logging implemented by a service. -Teams should look to leverage their definitions of Ready and Done to drive their logging practices: +Teams should look to leverage their definitions of Ready and Done to drive +their logging practices: **Definition of Ready** to include: -- Agreeing a common field or value, for example a TraceId or CorrelationId, and how to test for this property across logs +- Agreeing a common field or value, for example a TraceId or CorrelationId, and +how to test for this property across logs + - A consideration of the different log types and how to develop towards that -- What to log where -- The 'standard' EventHubLogProvider (https://github.com/UKHO/EventHub-Logging-Provider/tree/main/UKHO.Logging.EventHubLogProvider) should be used by default (unless there is good reason not to). If a different variant of logging is required consider extending the EventHubLogProvider to keep logging as standard as possible across all applications. -- or the *new* direct 'serilog' logging provider (https://github.com/UKHO/UKHO.Logging.Serilog) which has the same standards. - + +### What to log where + +- The 'standard' EventHubLogProvider +(https://github.com/UKHO/EventHub-Logging-Provider/tree/main/UKHO.Logging.EventHubLogProvider) +should be used by default (unless there is good reason not to). If a different +variant of logging is required consider extending the EventHubLogProvider to +keep logging as standard as possible across all applications. + +- or the *new* direct 'serilog' logging provider +(https://github.com/UKHO/UKHO.Logging.Serilog) which has the same standards. + **Definition of Done** to include: - Ensure log levels are correct for each environment ### Test Approach and TSR documents -Teams will be expected to demonstrate observability for the service and prove this is working as expected. +Teams will be expected to demonstrate observability for the service and prove +this is working as expected. ### Support team handovers -When receiving a service, support/CI teams will ensure that good logging practice has been adhered to, and this should be demonstrated. +When receiving a service, support/CI teams will ensure that good logging +practice has been adhered to, and this should be demonstrated. ### Smoke test/monitor log ingestion -Teams should ensure when creating logs in Elastic that they have ingested successfully to the correct index and are discoverable. This could be via an automated test or a manual check. +Teams should ensure when creating logs in Elastic that they have ingested +successfully to the correct index and are discoverable. This could be via an +automated test or a manual check. -If teams are using the legacy Azure Event Hub > LogStash > on-premise ElasticSearch pattern for ingesting logs, they must check LogStash for errors at every stage of the development process, using the DDC Grafana monitor set up for this purpose. +If teams are using the legacy Azure Event Hub > LogStash > on-premise +ElasticSearch pattern for ingesting logs, they must check LogStash for errors +at every stage of the development process, using the DDC Grafana monitor set up +for this purpose. ### Unit tests -Logging is a first class citizen when it comes to unit testing. At a code level, unit tests should assert that logs are logging to the level they should be. +Logging is a first class citizen when it comes to unit testing. At a code +level, unit tests should assert that logs are logging to the level they should +be. -In later environments as log levels become more restrictive, teams should test that the correct log levels are being used in accordance with the environment. +In later environments as log levels become more restrictive, teams should test +that the correct log levels are being used in accordance with the environment. ### Load testing -Load tests should be set to Production level logging, so the capacity generated from logs targeting Production is understood. The load testing environment should be as live-like as possible. +Load tests should be set to Production level logging, so the capacity generated +from logs targeting Production is understood. The load testing environment +should be as live-like as possible. *** ## Security -When implementing logging into a solution, it is essential to consider the following secure design practices: +When implementing logging into a solution, it is essential to consider the +following secure design practices: + +- Encode and validate any dangerous inputs before storing the log to prevent +[log injection](https://owasp.org/www-community/attacks/Log_Injection) or log +forging attacks. + +- Ensure that no sensitive information gets stored in logs, for example, +passwords, secret keys, and session IDs. -- Encode and validate any dangerous inputs before storing the log to prevent [log injection](https://owasp.org/www-community/attacks/Log_Injection) or log forging attacks. -- Ensure that no sensitive information gets stored in logs, for example, passwords, secret keys, and session IDs. -- Forward any logs to a centralised, secure logging system that implements a proper failover system. A load-balanced logging system will ensure that no log data is lost if a node is compromised. -- Protect log integrity by ensuring that log files cannot be tampered with, as a malicious attacker usually carries this out to cover up an attack. You can confirm this by implementing proper user permissions and logging into an immutable data store (such as Kibana). +- Forward any logs to a centralised, secure logging system that implements a +proper failover system. A load-balanced logging system will ensure that no log +data is lost if a node is compromised. + +- Protect log integrity by ensuring that log files cannot be tampered with, as +a malicious attacker usually carries this out to cover up an attack. You can +confirm this by implementing proper user permissions and logging into an +immutable data store (such as Kibana). *** @@ -117,26 +192,43 @@ When implementing logging into a solution, it is essential to consider the follo ### Cloud services - Elastic Cloud *(not yet available)* -Services held in Azure will log to an Azure Event Hub, with a separate storage account container used to track the processing of logs via a pointer. In Elastic Cloud, an Elastic Agent policy will have an integration defined that links to the Event Hub and storage account. Elastic Agent will process the logs so that they are more readable and searchable in Kibana, and will set up the necessary indexes and index lifecycle management (as per the Elastic Cloud retention details above). Your DDC resource can help with this set up. +Services held in Azure will log to an Azure Event Hub, with a separate storage +account container used to track the processing of logs via a pointer. In +Elastic Cloud, an Elastic Agent policy will have an integration defined that +links to the Event Hub and storage account. Elastic Agent will process the logs +so that they are more readable and searchable in Kibana, and will set up the +necessary indexes and index lifecycle management (as per the Elastic Cloud +retention details above). Your DDC resource can help with this set up. #### Cloud native logs -Cloud resource specific logging, such as native activity or diagnostic, can be used by teams where beneficial. These logs aren't usually ingested in Elastic, and teams need to keep a close eye on the costs associated with using them. +Cloud resource specific logging, such as native activity or diagnostic, can be +used by teams where beneficial. These logs aren't usually ingested in Elastic, +and teams need to keep a close eye on the costs associated with using them. #### Legacy - LogStash and on-premise ElasticSearch -An on-premise LogStash instance is responsible for ingesting and mutating logs from Azure Event Hub into on-premise ElasticSearch. This pattern is considered legacy and should only continue to be adopted whilst Elastic Cloud is not yet available. +An on-premise LogStash instance is responsible for ingesting and mutating logs +from Azure Event Hub into on-premise ElasticSearch. This pattern is considered +legacy and should only continue to be adopted whilst Elastic Cloud is not yet +available. ### On-premise services -On-premise services should use a log aggregator and shipper such as Elastic Filebeat, or the Serilog ElasticSearch sink, to ingest logs into Elastic Cloud. +On-premise services should use a log aggregator and shipper such as Elastic +Filebeat, or the Serilog ElasticSearch sink, to ingest logs into Elastic Cloud. #### Legacy - LogShipper and on-premise ElasticSearch -Using LogShipper to ingest logs from on-premise services to on-premise ElasticSearch is considered legacy and should no longer be adopted. +Using LogShipper to ingest logs from on-premise services to on-premise +ElasticSearch is considered legacy and should no longer be adopted. *** ## Migrating services to Elastic Cloud -Existing services that are currently using the legacy Azure Event Hub > LogStash > on-premise ElasticSearch pattern should have no errors (and no warnings) in LogStash before they are migrated to Elastic Cloud. Furthermore, these services should adhere to the logging policy before migration. +Existing services that are currently using the legacy Azure Event Hub > +LogStash > on-premise ElasticSearch pattern should have no errors (and no +warnings) in LogStash before they are migrated to Elastic Cloud. Furthermore, +these services should adhere to the logging policy before migration. + From 0e07d1eb54047b0cf754e4d9cecc6ba7af0bba48 Mon Sep 17 00:00:00 2001 From: Osric Wilkinson Date: Mon, 6 Oct 2025 11:02:22 +0100 Subject: [PATCH 2/9] Updates logging policy 1) Elastic Cloud is live and should be how everybody logs 2) UKHO.Logging.Serilog works. --- .../Logging/LoggingPolicy.md | 45 ++++++++----------- 1 file changed, 19 insertions(+), 26 deletions(-) diff --git a/software-engineering-policies/Logging/LoggingPolicy.md b/software-engineering-policies/Logging/LoggingPolicy.md index c4446056..7b68102f 100644 --- a/software-engineering-policies/Logging/LoggingPolicy.md +++ b/software-engineering-policies/Logging/LoggingPolicy.md @@ -114,14 +114,10 @@ how to test for this property across logs ### What to log where -- The 'standard' EventHubLogProvider -(https://github.com/UKHO/EventHub-Logging-Provider/tree/main/UKHO.Logging.EventHubLogProvider) -should be used by default (unless there is good reason not to). If a different -variant of logging is required consider extending the EventHubLogProvider to -keep logging as standard as possible across all applications. - -- or the *new* direct 'serilog' logging provider -(https://github.com/UKHO/UKHO.Logging.Serilog) which has the same standards. +Teams should use the [UKHO.Logging.Serilog](https://github.com/UKHO/UKHO.Logging.Serilog) +package to implement logging in their services. This package provides a +standardised way of logging across services, and provides a number of built-in +log enrichments that will be useful for searching and filtering logs in Elastic. **Definition of Done** to include: @@ -190,15 +186,15 @@ immutable data store (such as Kibana). ## Available log ingestion patterns -### Cloud services - Elastic Cloud *(not yet available)* +### Cloud services - Elastic Cloud + +Services held in Azure log to an Azure Event Hub. In Elastic Cloud, a managed +Elastic Agent policy has an integration that pulls from all Event Hubs, using a +dedicated storage account container to track the processing of logs. -Services held in Azure will log to an Azure Event Hub, with a separate storage -account container used to track the processing of logs via a pointer. In -Elastic Cloud, an Elastic Agent policy will have an integration defined that -links to the Event Hub and storage account. Elastic Agent will process the logs -so that they are more readable and searchable in Kibana, and will set up the -necessary indexes and index lifecycle management (as per the Elastic Cloud -retention details above). Your DDC resource can help with this set up. +An automated process discovers new Event Hubs, adds them to the Elastic Agent +policy, and sets up necessary indexes and index lifecycle management (as per +the Elastic Cloud retention details above). #### Cloud native logs @@ -206,22 +202,19 @@ Cloud resource specific logging, such as native activity or diagnostic, can be used by teams where beneficial. These logs aren't usually ingested in Elastic, and teams need to keep a close eye on the costs associated with using them. -#### Legacy - LogStash and on-premise ElasticSearch - -An on-premise LogStash instance is responsible for ingesting and mutating logs -from Azure Event Hub into on-premise ElasticSearch. This pattern is considered -legacy and should only continue to be adopted whilst Elastic Cloud is not yet -available. - ### On-premise services -On-premise services should use a log aggregator and shipper such as Elastic -Filebeat, or the Serilog ElasticSearch sink, to ingest logs into Elastic Cloud. +On-premise services should also log to an Azure Event Hub as a step towards +becoming cloud services. The +[UKHO.Logging.Serilog](https://github.com/UKHO/UKHO.Logging.Serilog) package +provides support for logging to Event Hubs. #### Legacy - LogShipper and on-premise ElasticSearch Using LogShipper to ingest logs from on-premise services to on-premise -ElasticSearch is considered legacy and should no longer be adopted. +ElasticSearch has been depreciated and should no longer be used in new +code. Existing projects using this pattern should look to migrate to Elastic Cloud +as soon as possible. *** From 15ebdf23ca5f2cf1df5a8702e6c1874981a63fdb Mon Sep 17 00:00:00 2001 From: Osric Wilkinson Date: Tue, 7 Oct 2025 10:07:55 +0100 Subject: [PATCH 3/9] Fixes spelling error --- software-engineering-policies/Logging/LoggingPolicy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/software-engineering-policies/Logging/LoggingPolicy.md b/software-engineering-policies/Logging/LoggingPolicy.md index 7b68102f..6c2be90a 100644 --- a/software-engineering-policies/Logging/LoggingPolicy.md +++ b/software-engineering-policies/Logging/LoggingPolicy.md @@ -8,7 +8,7 @@ by that service. ## Disclaimer Not all systems will be using or able to use Elastic logging due to -environemntal issues. In these cases the policy is not relevant. +environmental issues. In these cases the policy is not relevant. ## Log types and levels From 4d08a3d1645ad749d44167d01d6c41efdb32ced3 Mon Sep 17 00:00:00 2001 From: Osric Wilkinson Date: Tue, 7 Oct 2025 10:08:19 +0100 Subject: [PATCH 4/9] Updates and clarifies log retention policy --- .../Logging/LoggingPolicy.md | 31 +++++-------------- 1 file changed, 7 insertions(+), 24 deletions(-) diff --git a/software-engineering-policies/Logging/LoggingPolicy.md b/software-engineering-policies/Logging/LoggingPolicy.md index 6c2be90a..e200c003 100644 --- a/software-engineering-policies/Logging/LoggingPolicy.md +++ b/software-engineering-policies/Logging/LoggingPolicy.md @@ -66,35 +66,18 @@ email notification to supporting team). ## Retention -### Elastic Cloud +Data logged to Elastic Cloud comes under the following retention policy: -By default, services that are using the UKHO’s Elastic Cloud instance will come -under the following retention policy: - -- Logs will remain in the "Hot" phase for 7 days. This tier provides the best +- Logs are ingested into the "Hot" tier. This tier provides the best indexing and search performance. -- After 7 days, logs will move to the "Warm" tier. This tier is optimal for -data that is still likely to be searched, but infrequently updated. - -- After 30 days, logs will move to the "Cold" tier. This tier is used when -searching data less often and where we don’t need to update it. - -- After 90 days, logs will move to the "Frozen" tier for a longer term -retention. This is the most cost-effective way to store data and still be able -to search it. - -- After 365 days, logs will be deleted. - -### On-premise ElasticSearch (legacy) - -Currently the retention policy for logs differs by Elastic Search instance: +- After **2 days** logs are automatically moved to the "Cold" tier. This tier +is optimal for data that is still likely to be searched, but infrequently +- updated. -- Engineering – once indexes are created, they are subject to a 90 day deletion -policy. +- After **7 days**, logs in **non-live** are deleted, and can not be recovered. -- Live – logs are retained indefinitely. Although a policy exists that will -delete logs after a year, this is only used by one service. +- After **90 days**, logs in **live** are deleted, and can not be recovered. *** From 93bc0188bf10033bf8d8a4755b091ce8845aa8a4 Mon Sep 17 00:00:00 2001 From: Osric Wilkinson Date: Mon, 13 Oct 2025 10:12:50 +0100 Subject: [PATCH 5/9] Splits LoggingPolicy.md into policy and elastic sections --- .../Logging/LoggingPolicy.md | 259 ++++++++++-------- 1 file changed, 144 insertions(+), 115 deletions(-) diff --git a/software-engineering-policies/Logging/LoggingPolicy.md b/software-engineering-policies/Logging/LoggingPolicy.md index e200c003..1e7e18cf 100644 --- a/software-engineering-policies/Logging/LoggingPolicy.md +++ b/software-engineering-policies/Logging/LoggingPolicy.md @@ -1,173 +1,193 @@ # Logging Policy -This policy aims to provide guidance to software engineering teams on how the -services they develop or support should be logging. Given our logging -repository of choice is ElasticSearch, some of the guidance will be influenced -by that service. +This policy defines the logging requirements for services developed and +supported by software engineering teams at UKHO. -## Disclaimer +## Contents of Logs -Not all systems will be using or able to use Elastic logging due to -environmental issues. In these cases the policy is not relevant. +Logs must contain the following information: + - projectName (The overarching project that the service belongs to) + - serviceName (The name of an individual service) + - environment (One of "Development", "Test", "PreProduction", "Production") + - level ("Information", "Warning", "Error", "Fatal") + - Information is for regular messages that can be aggregated into a metric + - Warning is for expected problems that the service can recover from + - Error is for for unexpected problems that require investigation that should trigger an alert + - Fatal is for situations where the service is broken and requires immediate human intervention + - message (The data being logged. This should be a JSON formatted blob) + - traceId (If the process that generates the log was started from a request external to the project, + the traceId should be included to allow the logs to be correlated with the request) -## Log types and levels -This section will cover the different log types that services are expected to -produce, what log levels they should be at, how they should be indexed in -ElasticSearch, retention periods and any other related guidance. +## Log types and levels -Services can produce logs that fall into three main categories: diagnostic, -audit, and request/response. Each type will have an expected level and -ElasticSearch index pattern that is named according to the following -convention: **FullServiceName-environment-category**. +Services should produce logs that fall into three main categories: diagnostic, +audit/metric, and request/response. -### Diagnostic logs - Error, warning logs +### Diagnostic logs -- Expected minimum log levels for Production: Warning or Error +These are logs that provide information about the service's operation and are +generally useful for debugging. Projects need to balance the need to provide +useful information with the risk of logging too much information. -- Should ingest into a dedicated ElasticSearch index named as e.g., -`SalesCatalogueService-Dev1-Diagnostic` +Expected minimum log levels for Production: Warning or Error ### Audit/Metric logs -- Expected minimum log levels for Production: Information, but may differ based -on the individual type of audit or metric log +These are logs that provide information about the service's operation and are +generally useful for auditing and metrics. Where possible these logs should be +numeric or small enums to allow for aggregation. -- Should ingest into a dedicated ElasticSearch index named as e.g., -`SalesCatalogueService-Dev1-Audit` +Expected minimum log levels for Production: Information, but may differ based +on the individual type of audit or metric log ### Request/response logging -- Expected minimum log levels for Production: Information +These are logs that record requests to and responses from a service. These logs +are useful to trace a request through different services and to identify +problems with requests. -- Should ingest into a dedicated ElasticSearch index named as e.g., -`SalesCatalogueService-Dev1-HTTP` +Expected minimum log levels for Production: Information -### Further related guidance +### General logging guidance -- Healthcheck logging is not essential and can cause log saturation. We should -prefer instead to use healthcheck endpoints that can be polled. +- Health check logging is not essential and can cause log saturation. Prefer +using health check endpoints that can be polled. -- Consider the value of logs – e.g., we don’t need to log that an Azure Event -Hub is in existence and healthy. +- Consider the value of logs – avoid logging information that provides no +diagnostic or audit value. -- Avoid logging binary data to ElasticSearch. We should instead log a reference -to the object that is stored in blob storage. +- Consider the GPDR - avoid logging customer information. -- Consideration needs to be given to what is included in logs. Teams should -avoid logging many different properties in the hope that they will then have -captured everything. +- Avoid logging binary data or large files. Log references to objects stored in +appropriate storage instead. + +- Be consistent with data that is being logged across a project. + +- Teams should be selective about what is included in logs. Avoid logging +many different properties in the hope of capturing everything. - On-premise services need to have known mitigations for failures that may -occur upon trying to ingest logs to Elastic (e.g. log to EventViewer, send an -email notification to supporting team). +occur upon trying to ingest logs (e.g., log to EventViewer, send an email +notification to supporting team). -*** +## Security -## Retention +When implementing logging, consider the following secure design practices: -Data logged to Elastic Cloud comes under the following retention policy: +- Use structured logging instead of string concatenation to avoid log injection vulnerabilities. -- Logs are ingested into the "Hot" tier. This tier provides the best -indexing and search performance. +- Ensure that no sensitive information gets stored in logs, for example, + passwords, secret keys, and session IDs. -- After **2 days** logs are automatically moved to the "Cold" tier. This tier -is optimal for data that is still likely to be searched, but infrequently -- updated. +- Ensure that no personal information gets stored in logs, for example, + customer names, addresses, and email addresses. -- After **7 days**, logs in **non-live** are deleted, and can not be recovered. +## Retention -- After **90 days**, logs in **live** are deleted, and can not be recovered. +Log retention is manged by the Observability team and is set to: -*** +- **Non-live environments**: 7 days +- **Live environments**: 90 days -## Testing and assuring logging +Logs are not recoverable after this period. -This section will cover how to test the logging implemented by a service. +## Testing and assuring logging -Teams should look to leverage their definitions of Ready and Done to drive -their logging practices: +Teams should leverage their definitions of Ready and Done to drive logging +practices: -**Definition of Ready** to include: +**Definition of Ready** should include: -- Agreeing a common field or value, for example a TraceId or CorrelationId, and -how to test for this property across logs +- A common dictionary for log messages ensuring that values are comparable +across services within a project. - A consideration of the different log types and how to develop towards that -### What to log where +**Definition of Done** should include: + +- Ensure log levels are correct for each environment + +### Implementation Teams should use the [UKHO.Logging.Serilog](https://github.com/UKHO/UKHO.Logging.Serilog) package to implement logging in their services. This package provides a -standardised way of logging across services, and provides a number of built-in -log enrichments that will be useful for searching and filtering logs in Elastic. +standardised way of logging across services and provides built-in log +enrichments. -**Definition of Done** to include: +**Note**: Teams should prefer UKHO.Logging.Serilog over the legacy +UKHO.EventHubLogging provider. Existing projects using the legacy provider +should plan to migrate. -- Ensure log levels are correct for each environment +### Testing requirements -### Test Approach and TSR documents +- **Test Approach and TSR documents**: Teams must demonstrate observability for +the service and prove this is working as expected. -Teams will be expected to demonstrate observability for the service and prove -this is working as expected. +- **Support team handovers**: Support/CI teams will ensure that good logging +practice has been adhered to, and this should be demonstrated. -### Support team handovers +- **Smoke test/monitor log ingestion**: Teams should ensure logs have ingested +successfully and are discoverable. This could be via an automated test or a +manual check. -When receiving a service, support/CI teams will ensure that good logging -practice has been adhered to, and this should be demonstrated. +- **Unit tests**: Logging is a first class citizen. Unit tests should assert +that logs are logging to the expected level. -### Smoke test/monitor log ingestion +- **Load testing**: Load tests should use Production level logging, so the +capacity generated from logs targeting Production is understood. The load +testing environment should be as live-like as possible. -Teams should ensure when creating logs in Elastic that they have ingested -successfully to the correct index and are discoverable. This could be via an -automated test or a manual check. -If teams are using the legacy Azure Event Hub > LogStash > on-premise -ElasticSearch pattern for ingesting logs, they must check LogStash for errors -at every stage of the development process, using the DDC Grafana monitor set up -for this purpose. +*** -### Unit tests +# Guidance for Elastic -Logging is a first class citizen when it comes to unit testing. At a code -level, unit tests should assert that logs are logging to the level they should -be. +This section provides technical guidance for implementing the Logging Policy +using ElasticSearch and Elastic Cloud. -In later environments as log levels become more restrictive, teams should test -that the correct log levels are being used in accordance with the environment. +## Technology Choice -### Load testing +Elastic is the tool of choice for log aggregation and analysis. If there are +technical considerations which prevent use of Elastic, consider appropriate +alternatives and detail how these make a best effort to meet this policy in +your design documentation. -Load tests should be set to Production level logging, so the capacity generated -from logs targeting Production is understood. The load testing environment -should be as live-like as possible. +## Index naming convention -*** +Logs should be indexed in ElasticSearch according to the following convention: +**FullServiceName-environment-category** -## Security +### Diagnostic logs -When implementing logging into a solution, it is essential to consider the -following secure design practices: +Should ingest into a dedicated ElasticSearch index named as e.g., +`SalesCatalogueService-Dev1-Diagnostic` -- Encode and validate any dangerous inputs before storing the log to prevent -[log injection](https://owasp.org/www-community/attacks/Log_Injection) or log -forging attacks. +### Audit/Metric logs -- Ensure that no sensitive information gets stored in logs, for example, -passwords, secret keys, and session IDs. +Should ingest into a dedicated ElasticSearch index named as e.g., +`SalesCatalogueService-Dev1-Audit` + +### Request/response logging -- Forward any logs to a centralised, secure logging system that implements a -proper failover system. A load-balanced logging system will ensure that no log -data is lost if a node is compromised. +Should ingest into a dedicated ElasticSearch index named as e.g., +`SalesCatalogueService-Dev1-HTTP` -- Protect log integrity by ensuring that log files cannot be tampered with, as -a malicious attacker usually carries this out to cover up an attack. You can -confirm this by implementing proper user permissions and logging into an -immutable data store (such as Kibana). +## Retention implementation in Elastic Cloud -*** +Data logged to Elastic Cloud follows this retention implementation: -## Available log ingestion patterns +- Logs are ingested into the "Hot" tier for best indexing and search +performance. + +- After **2 days** logs are automatically moved to the "Cold" tier, optimal for +data that is still likely to be searched but infrequently updated. + +- After **7 days**, logs in **non-live** are deleted and cannot be recovered. + +- After **90 days**, logs in **live** are deleted and cannot be recovered. + +## Log ingestion patterns ### Cloud services - Elastic Cloud @@ -176,8 +196,7 @@ Elastic Agent policy has an integration that pulls from all Event Hubs, using a dedicated storage account container to track the processing of logs. An automated process discovers new Event Hubs, adds them to the Elastic Agent -policy, and sets up necessary indexes and index lifecycle management (as per -the Elastic Cloud retention details above). +policy, and sets up necessary indexes and index lifecycle management. #### Cloud native logs @@ -192,19 +211,29 @@ becoming cloud services. The [UKHO.Logging.Serilog](https://github.com/UKHO/UKHO.Logging.Serilog) package provides support for logging to Event Hubs. -#### Legacy - LogShipper and on-premise ElasticSearch +## Legacy patterns and migration + +### Legacy - LogShipper and on-premise ElasticSearch Using LogShipper to ingest logs from on-premise services to on-premise -ElasticSearch has been depreciated and should no longer be used in new -code. Existing projects using this pattern should look to migrate to Elastic Cloud -as soon as possible. +ElasticSearch has been deprecated and should no longer be used in new code. +Existing projects using this pattern should look to migrate to Elastic Cloud as +soon as possible. -*** +### Legacy - UKHO.EventHubLogging provider -## Migrating services to Elastic Cloud +The UKHO.EventHubLogging provider has been superseded by UKHO.Logging.Serilog. +Teams using the legacy provider should plan their migration path. -Existing services that are currently using the legacy Azure Event Hub > -LogStash > on-premise ElasticSearch pattern should have no errors (and no -warnings) in LogStash before they are migrated to Elastic Cloud. Furthermore, -these services should adhere to the logging policy before migration. +### Migrating services to Elastic Cloud + +Existing services currently using the legacy Azure Event Hub > LogStash > +on-premise ElasticSearch pattern should have no errors (and no warnings) in +LogStash before they are migrated to Elastic Cloud. + +If teams are using the legacy Azure Event Hub > LogStash > on-premise +ElasticSearch pattern for ingesting logs, they must check LogStash for errors +at every stage of the development process, using the DDC Grafana monitor set up +for this purpose. +Services should adhere to the logging policy before migration. From 5317db34403e8ce4d17c6b77b293abe54298fd7a Mon Sep 17 00:00:00 2001 From: Osric Wilkinson Date: Fri, 17 Oct 2025 09:05:38 +0100 Subject: [PATCH 6/9] Corrects GRDP => GDPR --- software-engineering-policies/Logging/LoggingPolicy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/software-engineering-policies/Logging/LoggingPolicy.md b/software-engineering-policies/Logging/LoggingPolicy.md index 1e7e18cf..670c98ba 100644 --- a/software-engineering-policies/Logging/LoggingPolicy.md +++ b/software-engineering-policies/Logging/LoggingPolicy.md @@ -57,7 +57,7 @@ using health check endpoints that can be polled. - Consider the value of logs – avoid logging information that provides no diagnostic or audit value. -- Consider the GPDR - avoid logging customer information. +- Consider the GDPR - avoid logging customer information. - Avoid logging binary data or large files. Log references to objects stored in appropriate storage instead. From ba3c8aced1a8ae0004b49ab5d7b52d98a7f05611 Mon Sep 17 00:00:00 2001 From: Osric Wilkinson Date: Fri, 17 Oct 2025 09:06:11 +0100 Subject: [PATCH 7/9] Stiffens the requirement to use Elastic --- software-engineering-policies/Logging/LoggingPolicy.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/software-engineering-policies/Logging/LoggingPolicy.md b/software-engineering-policies/Logging/LoggingPolicy.md index 670c98ba..3948d3a7 100644 --- a/software-engineering-policies/Logging/LoggingPolicy.md +++ b/software-engineering-policies/Logging/LoggingPolicy.md @@ -148,10 +148,9 @@ using ElasticSearch and Elastic Cloud. ## Technology Choice -Elastic is the tool of choice for log aggregation and analysis. If there are -technical considerations which prevent use of Elastic, consider appropriate -alternatives and detail how these make a best effort to meet this policy in -your design documentation. +Elastic is the business's logging platform. There can be good reasons to select +other logging platforms, however approval from the Observability team and +Architectural Practice Forum is required in these cases. ## Index naming convention From 145306a76b53baa4302cc6b044ec79f086f8c51b Mon Sep 17 00:00:00 2001 From: Osric Wilkinson Date: Mon, 20 Oct 2025 08:32:09 +0100 Subject: [PATCH 8/9] Updates logging policy based on AI suggestions. --- .../Logging/LoggingPolicy.md | 66 ++++++++++--------- 1 file changed, 34 insertions(+), 32 deletions(-) diff --git a/software-engineering-policies/Logging/LoggingPolicy.md b/software-engineering-policies/Logging/LoggingPolicy.md index 3948d3a7..6d2f5bea 100644 --- a/software-engineering-policies/Logging/LoggingPolicy.md +++ b/software-engineering-policies/Logging/LoggingPolicy.md @@ -6,18 +6,18 @@ supported by software engineering teams at UKHO. ## Contents of Logs Logs must contain the following information: - - projectName (The overarching project that the service belongs to) - - serviceName (The name of an individual service) - - environment (One of "Development", "Test", "PreProduction", "Production") - - level ("Information", "Warning", "Error", "Fatal") + +- projectName (The overarching project that the service belongs to) +- serviceName (The name of an individual service) +- environment (One of "Development", "Test", "PreProduction", "Production") +- level ("Information", "Warning", "Error", "Fatal") - Information is for regular messages that can be aggregated into a metric - Warning is for expected problems that the service can recover from - Error is for for unexpected problems that require investigation that should trigger an alert - Fatal is for situations where the service is broken and requires immediate human intervention - - message (The data being logged. This should be a JSON formatted blob) - - traceId (If the process that generates the log was started from a request external to the project, - the traceId should be included to allow the logs to be correlated with the request) - +- message (The log message from the application) +- traceId (OpenTelemetry format trace id, if available. Sometimes called a + "correlation id") ## Log types and levels @@ -45,31 +45,32 @@ on the individual type of audit or metric log These are logs that record requests to and responses from a service. These logs are useful to trace a request through different services and to identify -problems with requests. +problems with requests. Expected minimum log levels for Production: Information ### General logging guidance - Health check logging is not essential and can cause log saturation. Prefer -using health check endpoints that can be polled. + using health check endpoints that can be polled. - Consider the value of logs – avoid logging information that provides no -diagnostic or audit value. + diagnostic or audit value, and include information that makes this request + different from other requests. - Consider the GDPR - avoid logging customer information. -- Avoid logging binary data or large files. Log references to objects stored in -appropriate storage instead. +- Avoid logging binary data or large files. Log references to objects stored in + appropriate storage instead. -- Be consistent with data that is being logged across a project. +- Be consistent, use the same language and format across all logs. - Teams should be selective about what is included in logs. Avoid logging -many different properties in the hope of capturing everything. + many different properties in the hope of capturing everything. - On-premise services need to have known mitigations for failures that may -occur upon trying to ingest logs (e.g., log to EventViewer, send an email -notification to supporting team). + occur upon trying to ingest logs (e.g., log to EventViewer, send an email + notification to supporting team). ## Security @@ -85,7 +86,7 @@ When implementing logging, consider the following secure design practices: ## Retention -Log retention is manged by the Observability team and is set to: +Log retention is managed by the Observability team and is set to: - **Non-live environments**: 7 days - **Live environments**: 90 days @@ -100,7 +101,7 @@ practices: **Definition of Ready** should include: - A common dictionary for log messages ensuring that values are comparable -across services within a project. + across services within a project. - A consideration of the different log types and how to develop towards that @@ -122,22 +123,21 @@ should plan to migrate. ### Testing requirements - **Test Approach and TSR documents**: Teams must demonstrate observability for -the service and prove this is working as expected. + the service and prove this is working as expected. - **Support team handovers**: Support/CI teams will ensure that good logging -practice has been adhered to, and this should be demonstrated. + practice has been adhered to, and this should be demonstrated. - **Smoke test/monitor log ingestion**: Teams should ensure logs have ingested -successfully and are discoverable. This could be via an automated test or a -manual check. + successfully and are discoverable. This could be via an automated test or a + manual check. - **Unit tests**: Logging is a first class citizen. Unit tests should assert -that logs are logging to the expected level. + that logs are logging to the expected level. - **Load testing**: Load tests should use Production level logging, so the -capacity generated from logs targeting Production is understood. The load -testing environment should be as live-like as possible. - + capacity generated from logs targeting Production is understood. The load + testing environment should be as live-like as possible. *** @@ -149,13 +149,15 @@ using ElasticSearch and Elastic Cloud. ## Technology Choice Elastic is the business's logging platform. There can be good reasons to select -other logging platforms, however approval from the Observability team and +other logging platforms, however approval from the Observability team and Architectural Practice Forum is required in these cases. -## Index naming convention +## Elastic Index Naming Convention Logs should be indexed in ElasticSearch according to the following convention: -**FullServiceName-environment-category** +**FullServiceName-environment-category**. The Observability team automation +creates indexes, so teams should not need to create them manually. Teams +with legacy indexes should be migrated to the new naming convention. ### Diagnostic logs @@ -177,10 +179,10 @@ Should ingest into a dedicated ElasticSearch index named as e.g., Data logged to Elastic Cloud follows this retention implementation: - Logs are ingested into the "Hot" tier for best indexing and search -performance. + performance. - After **2 days** logs are automatically moved to the "Cold" tier, optimal for -data that is still likely to be searched but infrequently updated. + data that is still likely to be searched but infrequently updated. - After **7 days**, logs in **non-live** are deleted and cannot be recovered. From f8c4ec898858f572707e1261e1fafdedbbd9e62e Mon Sep 17 00:00:00 2001 From: Osric Wilkinson Date: Mon, 20 Oct 2025 08:39:39 +0100 Subject: [PATCH 9/9] Actually splits the logging policy into two files. --- .../GuidanceForLoggingToElasticCloud.md | 99 +++++++++++++++++ .../Logging/LoggingPolicy.md | 103 +----------------- 2 files changed, 102 insertions(+), 100 deletions(-) create mode 100644 software-engineering-policies/Logging/GuidanceForLoggingToElasticCloud.md diff --git a/software-engineering-policies/Logging/GuidanceForLoggingToElasticCloud.md b/software-engineering-policies/Logging/GuidanceForLoggingToElasticCloud.md new file mode 100644 index 00000000..b223acdb --- /dev/null +++ b/software-engineering-policies/Logging/GuidanceForLoggingToElasticCloud.md @@ -0,0 +1,99 @@ +# Guidance for Logging to Elastic Cloud + +To be read in conjunction with the [Logging Policy](LoggingPolicy.md). + +This section provides technical guidance for implementing the Logging Policy +using ElasticSearch and Elastic Cloud. + +## Technology Choice + +Elastic is the business's logging platform. There can be good reasons to select +other logging platforms, however approval from the Observability team and +Architectural Practice Forum is required in these cases. + +## Elastic Index Naming Convention + +Logs should be indexed in ElasticSearch according to the following convention: +**FullServiceName-environment-category**. The Observability team automation +creates indexes, so teams should not need to create them manually. Teams +with legacy indexes should be migrated to the new naming convention. + +### Diagnostic logs + +Should ingest into a dedicated ElasticSearch index named as e.g., +`SalesCatalogueService-Dev1-Diagnostic` + +### Audit/Metric logs + +Should ingest into a dedicated ElasticSearch index named as e.g., +`SalesCatalogueService-Dev1-Audit` + +### Request/response logging + +Should ingest into a dedicated ElasticSearch index named as e.g., +`SalesCatalogueService-Dev1-HTTP` + +## Retention implementation in Elastic Cloud + +Data logged to Elastic Cloud follows this retention implementation: + +- Logs are ingested into the "Hot" tier for best indexing and search + performance. + +- After **2 days** logs are automatically moved to the "Cold" tier, optimal for + data that is still likely to be searched but infrequently updated. + +- After **7 days**, logs in **non-live** are deleted and cannot be recovered. + +- After **90 days**, logs in **live** are deleted and cannot be recovered. + +## Log ingestion patterns + +### Cloud services - Elastic Cloud + +Services held in Azure log to an Azure Event Hub. In Elastic Cloud, a managed +Elastic Agent policy has an integration that pulls from all Event Hubs, using a +dedicated storage account container to track the processing of logs. + +An automated process discovers new Event Hubs, adds them to the Elastic Agent +policy, and sets up necessary indexes and index lifecycle management. + +#### Cloud native logs + +Cloud resource specific logging, such as native activity or diagnostic, can be +used by teams where beneficial. These logs aren't usually ingested in Elastic, +and teams need to keep a close eye on the costs associated with using them. + +### On-premise services + +On-premise services should also log to an Azure Event Hub as a step towards +becoming cloud services. The +[UKHO.Logging.Serilog](https://github.com/UKHO/UKHO.Logging.Serilog) package +provides support for logging to Event Hubs. + +## Legacy patterns and migration + +### Legacy - LogShipper and on-premise ElasticSearch + +Using LogShipper to ingest logs from on-premise services to on-premise +ElasticSearch has been deprecated and should no longer be used in new code. +Existing projects using this pattern should look to migrate to Elastic Cloud as +soon as possible. + +### Legacy - UKHO.EventHubLogging provider + +The UKHO.EventHubLogging provider has been superseded by UKHO.Logging.Serilog. +Teams using the legacy provider should plan their migration path. + +### Migrating services to Elastic Cloud + +Existing services currently using the legacy Azure Event Hub > LogStash > +on-premise ElasticSearch pattern should have no errors (and no warnings) in +LogStash before they are migrated to Elastic Cloud. + +If teams are using the legacy Azure Event Hub > LogStash > on-premise +ElasticSearch pattern for ingesting logs, they must check LogStash for errors +at every stage of the development process, using the DDC Grafana monitor set up +for this purpose. + +Services should adhere to the logging policy before migration. diff --git a/software-engineering-policies/Logging/LoggingPolicy.md b/software-engineering-policies/Logging/LoggingPolicy.md index 6d2f5bea..6cc707dc 100644 --- a/software-engineering-policies/Logging/LoggingPolicy.md +++ b/software-engineering-policies/Logging/LoggingPolicy.md @@ -3,6 +3,9 @@ This policy defines the logging requirements for services developed and supported by software engineering teams at UKHO. +UKHO uses Elastic Cloud for log storage, aggregation, and analysis. For +more detail about the use of Elastic for logging, see [the specific guidance](GuidanceForLoggingToElasticCloud.md). + ## Contents of Logs Logs must contain the following information: @@ -138,103 +141,3 @@ should plan to migrate. - **Load testing**: Load tests should use Production level logging, so the capacity generated from logs targeting Production is understood. The load testing environment should be as live-like as possible. - -*** - -# Guidance for Elastic - -This section provides technical guidance for implementing the Logging Policy -using ElasticSearch and Elastic Cloud. - -## Technology Choice - -Elastic is the business's logging platform. There can be good reasons to select -other logging platforms, however approval from the Observability team and -Architectural Practice Forum is required in these cases. - -## Elastic Index Naming Convention - -Logs should be indexed in ElasticSearch according to the following convention: -**FullServiceName-environment-category**. The Observability team automation -creates indexes, so teams should not need to create them manually. Teams -with legacy indexes should be migrated to the new naming convention. - -### Diagnostic logs - -Should ingest into a dedicated ElasticSearch index named as e.g., -`SalesCatalogueService-Dev1-Diagnostic` - -### Audit/Metric logs - -Should ingest into a dedicated ElasticSearch index named as e.g., -`SalesCatalogueService-Dev1-Audit` - -### Request/response logging - -Should ingest into a dedicated ElasticSearch index named as e.g., -`SalesCatalogueService-Dev1-HTTP` - -## Retention implementation in Elastic Cloud - -Data logged to Elastic Cloud follows this retention implementation: - -- Logs are ingested into the "Hot" tier for best indexing and search - performance. - -- After **2 days** logs are automatically moved to the "Cold" tier, optimal for - data that is still likely to be searched but infrequently updated. - -- After **7 days**, logs in **non-live** are deleted and cannot be recovered. - -- After **90 days**, logs in **live** are deleted and cannot be recovered. - -## Log ingestion patterns - -### Cloud services - Elastic Cloud - -Services held in Azure log to an Azure Event Hub. In Elastic Cloud, a managed -Elastic Agent policy has an integration that pulls from all Event Hubs, using a -dedicated storage account container to track the processing of logs. - -An automated process discovers new Event Hubs, adds them to the Elastic Agent -policy, and sets up necessary indexes and index lifecycle management. - -#### Cloud native logs - -Cloud resource specific logging, such as native activity or diagnostic, can be -used by teams where beneficial. These logs aren't usually ingested in Elastic, -and teams need to keep a close eye on the costs associated with using them. - -### On-premise services - -On-premise services should also log to an Azure Event Hub as a step towards -becoming cloud services. The -[UKHO.Logging.Serilog](https://github.com/UKHO/UKHO.Logging.Serilog) package -provides support for logging to Event Hubs. - -## Legacy patterns and migration - -### Legacy - LogShipper and on-premise ElasticSearch - -Using LogShipper to ingest logs from on-premise services to on-premise -ElasticSearch has been deprecated and should no longer be used in new code. -Existing projects using this pattern should look to migrate to Elastic Cloud as -soon as possible. - -### Legacy - UKHO.EventHubLogging provider - -The UKHO.EventHubLogging provider has been superseded by UKHO.Logging.Serilog. -Teams using the legacy provider should plan their migration path. - -### Migrating services to Elastic Cloud - -Existing services currently using the legacy Azure Event Hub > LogStash > -on-premise ElasticSearch pattern should have no errors (and no warnings) in -LogStash before they are migrated to Elastic Cloud. - -If teams are using the legacy Azure Event Hub > LogStash > on-premise -ElasticSearch pattern for ingesting logs, they must check LogStash for errors -at every stage of the development process, using the DDC Grafana monitor set up -for this purpose. - -Services should adhere to the logging policy before migration.