Who is the Company

The organization is a large multinational consumer electronics manufacturer.

The Challenge

The technology division of the marketing business unit is responsible for monitoring around 1,500 URLs used by various apps for marketing purposes. The technology environment included a Jenkins server that ran a job every 15 minutes to check on the applications' status. However, the system generated excessive false positives, wasting time and effort.

Many URLs that needed monitoring were generated dynamically, further compounding the problem. The company’s systems reliability engineering (SRE) team manually added all auto-generated URLs to the Jenkins job. This manual intervention was also prone to errors within the system.

In addition to the above, the business unit needed a centralized dashboard to provide the team with a single place to get an accurate picture of application availability.

The tools used by the organization’s SRE team include Kubernetes, Jenkins, Amazon Web Services (AWS), Vault, Consul, vCenter, NGINX Plus, NetScaler, FreeIPA, Argo CD, Spinnaker, GitHub, Jira, Confluence, and Artifactory.

In brief, the organization was looking for the following:

  • Accurate data on application status: The current system generated far too many false positives, making its existing status alerts virtually meaningless. The company needed a system to validate alerts and provide accurate status data.
  • Immediate integration of auto-generated URLs: The company sought a solution to capture auto-generated URLs as they were created and automatically added to the Jenkins job.
  • Comprehensive centralized system health dashboard: The company urgently needed a single place where stakeholders could view accurate, up-to-date, and highly visible application URL status.

The Solution

After extensive consultation with relevant stakeholders, our DevOps team decided to employ a comprehensive health analytics dashboard. This platform consists of several microservices developed by our engineers that can gather metrics, review Splunk logs, and produce analytics on application environment user access.

Our engineers developed a set of custom Prometheus exporters to adapt the health analytics dashboard to the organization’s systems. These exporters monitor application health and supply the information to Prometheus. Our team also developed a loader and other microservices that grabs data from Prometheus and pushes it to a PostgreSQL database.

We then implemented analytics that offers product tracking that the organization's internal teams require in each environment. The health analytics dashboard was integrated with Splunk and a custom alert script, allowing it to push accurate and timely notifications to the teams via email, Jira, Slack, and PagerDuty.

Key features of the solution include:

  • Proven health dashboard system: Our team integrated a well-proven set of microservices that form the health dashboard. This approach saved the company time and money as our system has already been extensively tested.
  • Quality analytics data: The new dashboard eliminates duplication and filters out bad data, resulting in high-quality analytics. This significantly reduced the number of false positives and enabled company engineers to respond to real issues quickly.
  • Automation and scalability: The new system can automatically add newly generated URLs to the Jenkins job without manual intervention. This means the system easily scales to monitor numerous URLs and applications.

Business Impact

Key benefits the company enjoyed after project completion:

  • Fast access to health data: Our solution implemented a centralized one-stop-shop monitoring solution that provides the company’s SRE team immediate access to the system’s health data. The SRE team no longer wastes precious time searching for the right tool.
  • Enhanced visibility: The new dashboard provides high-quality analytics data on the company’s applications, tools, and infrastructure status.
  • Proactive response: The instant alert system substantially reduced the delay between response and issue resolution time. With newly generated URLs automatically added to the to-be-monitored list, problems are now spotted almost immediately.
  • Increased productivity: The company’s SRE team enjoys enhanced productivity as tedious monitoring tasks are handled automatically, allowing the team to focus on more critical issues.
  • Enhanced user experience: The new system includes many new features that take the mystery out of system health checks.

Technologies Used

Prometheus: An open-source system that supports a multidimensional data model and turns metrics into actionable insights
Splunk: A horizontal technology used for application management, security, and compliance, as well as business and web analytics
Kubernetes: An open-source container orchestration tool for automating computer application deployment, scaling, and management
ReactJS/NodeJS: A JavaScript-based client-server technology used to build single-page applications running in the browser
Python: A popular scripted programming language
PostgreSQL: A highly performant open-source database

Related Capabilities

Utilize Actionable Insights from Multiple Data Hubs to Gain More Customers and Boost Sales

Unlock the power of the data insights buried deep within your diverse systems across the organization. We empower businesses to effectively collect, beautifully visualize, critically analyze, and intelligently interpret data to support organizational goals. Our team ensures good returns on the big data technology investments with the effective use of the latest data and analytics tools.

Do you have a similar project in mind?

Enter your email address to start the conversation