Getting Started with NetDevOps

Date: Dec 24, 2023 By Ivo Pinto and Faisal Chaudhry. Sample Chapter is provided courtesy of Cisco Press.

In this sample chapter from Automating and Orchestrating Networks with NetDevOps, you will explore main use cases for NetDevOps applications compared with traditional methods. You will gain insights into the decision-making processes, tooling choices, and required skills; and navigate common challenges and lessons learned.

In this chapter, we describe the main use cases where NetDevOps excels. We go over what the use case is, how it is handled in a traditional networking fashion and finish with the NetDevOps approach and its benefits. We also describe common decisions in the adoption of NetDevOps such as tooling choices, skills required, and possible starting points. This chapter finishes with lessons learned from many NetDevOps adoption processes: common challenges and mitigations. In summary, you will learn the following about NetDevOps:

What does it solve
How does it solve it
Possible starting points
Decisions and investments
Common pitfalls and recommendations

Use Cases

In Chapter 1, “Why Do We Need NetDevOps?”, you learned what NetDevOps is, its benefits, and its components. In this section, you will learn what specific use cases you can benefit from by applying NetDevOps practices and tools.

More specifically, this section goes into detail on each individual use case NetDevOps can help you with. Although this is an extensive list, as you can see in Figure 2-1, it is possible that not all use cases are represented here.

FIGURE 2.1 NetDevOps Use Cases Mind Map

The use case deep dives focus on the stages a typical continuous integration/continuous delivery/deployment (CI/CD) pipeline should have rather than on the automation scripts or infrastructure as code (IaC) that performs the actual actions. The reason for this choice is that we consider network automation a well-documented topic that is highly dependent on tool choices and desired automated actions, and we want you to learn the orchestration process behind each use case while keeping this chapter’s stages and practices mostly tool and technology agnostic.

These use cases are focused on the usage perspective, meaning that you can and should have CI/CD pipelines to merge developed code into your source control. However, this is not our focus. The following pipelines are the pipelines you, someone from your team, or an automatic trigger would start when you need to perform an action in your network. Note the pipelines that are triggered when you modify your automation code.

Triggers are a topic we will dive into in Chapter 3, “How to Implement CI/CD Pipelines with Jenkins.” However, for this chapter, it is important to understand that pipelines can be triggered in many different ways: manually by a user, automatically by a change in a code repository, automatically in response to an event (for example, a high % of CPU), and so on. Throughout this chapter, you will see mentions of possible triggers for each use case.

Provisioning

Provisioning, which is often confused with configuring, is the process of setting up the IT infrastructure. Infrastructure in this context can mean many things: virtual machines, containers, virtual network devices, services, and so on. However, when you want to create a virtual local area network (VLAN) and consult the documentation, you will often see “Configuring a VLAN.” That’s because configuring comes after provisioning; when you configure a VLAN, the switch was already provisioned.

In a physical environment, you provision a switch (cabling, racking, and stacking) and then configure it (VLANs, IP addresses, interfaces, and so on). In a virtual environment, you provision a virtual machine (number of CPUs, amount of RAM, or amount of storage) and then configure it (installing a software package, patching a security vulnerability, and so on) or you provision a virtual switch (virtual-to-physical port mappings) in a hypervisor environment and then configure port groups (VM-to-VLAN mappings).

In this book, provisioning refers to creating resources in networking environments. Mostly this happens in cloud environments—not necessarily only in public cloud environments, such as Amazon Web Services, Google Cloud Platform, and Microsoft Azure, which are the more famous options, but in any cloud environment, such as the less popular private clouds like Red Hat OpenStack and VMware environments.

These provisioning actions can take effect in different forms, such as command-line interfaces (CLIs), graphical user interfaces (GUIs), and application programming interfaces (APIs). In our NetDevOps context, and because automation is one of our core components, you will more often see an API as the default choice.

A traditional networking provisioning workflow is usually orchestrated as a series of manual steps executed by one or more individuals. This workflow is achieved either manually, by clicking through the GUI for simpler actions, or manually, by executing a series of automation scripts that make use of the product’s API or CLI.

So, what does a typical provisioning NetDevOps pipeline look like and how does it differ from a traditional one? Figure 2-2 shows an example. This pipeline starts by retrieving the required code from a code repository, which is a version control system (VCS). In this repository, you will likely find the automation code to perform the provisioning action, which can be implemented by a variety of different tools (for example, Ansible, Terraform, Python, Unix shell, and so on), but also some specific variables required for this code to run. In concrete terms, this could be a generic Ansible playbook to replace the running-config on an IOS-XE switch and a variable file with the specific running-config to push.

FIGURE 2.2 Provisioning Pipeline

For some automation workflows, there will be no variables file, either because it is a static workflow without any possible variation (for example, a script that saves the running-config to the startup-config in all your network switches) or because the pipeline prompts the user for the required variables at the time of execution. We call these “runtime variables” because they are provided at runtime.

The second stage, labeled “Security Checks,” is a security verification stage. This is often ignored, not because people do not like security, but because it is, in many cases, hard to implement. In this stage, you verify that the combination of automation plus variables does not violate any security policy in a static way (“static” because you do not execute this automation to make this verification). Some tools can make this verification easier (for example, tfsec for Terraform). A common verification for cloud environments is to make sure the firewall rules are not open to the world.

The third stage, “Deploy to Dev,” is also a stage often missing from many production implementations. This is where you execute the automation code against a test/development/staging environment. This stage is meant to be executed against an environment that mirrors your production environment to give you confidence before you possibly make environment-breaking changes. Because of costs, historical reasons, or other factors, as we discuss later in this chapter, this stage is often skipped.

The pipeline’s development deployment stage is followed by a more thorough testing than static analysis. In this stage, you can go as deep as you want with your testing. From the simple “are the resources created with my parameters?” test to actual functional testing of the created resources—anything is possible. This serves as a decision point of either to continue and provision the same resources in your production environment or to stop (in the case of discovered unexpected behaviors).

Assuming you did not abort, it is time for the production deployment stage. In this stage, you execute the same code as you did in the development environment but with a different target—production. This should have the exact same effect, and hopefully you now understand the value of the previous stage and the added assurance you get from having a testing environment.

Although you tested in the staging environment, the following stage involves testing in production. Here, oftentimes you do even a more thorough testing, and you include some end-to-end tests that you could not run in your testing environment. It is common that testing environments are smaller in size and complexity, and because of that some tests can only be run in production. Likewise, load tests (meaning putting a resource under simulated demand) are typically run in this stage.

The last stage is the clean-up stage, and it is optional. In this stage, you often clean up your testing environment (in the case that is virtual). However, if you have a static testing environment that mimics your production environment, you would not delete it in this stage.

Using Terraform to provision a tenant in Cisco Application Centric Infrastructure (ACI) is a concrete implementation of this use case. First, as shown in Example 2-1, you need to have Terraform code to implement a Cisco ACI tenant. With this code in a code repository, you can create a CI/CD pipeline with the previously mentioned stages.

Example 2-1 Source Code for aci_tenant.yml

 resource "aci_tenant" "vmotion_tenant" {
    name        = "vMotion"
    description = "Tenant for vMotion"
}

In the first stage, you check out the code from the code repository. In the second stage, you can use a script to verify whether the tenant’s name complies with your naming standards (for example, whether it uses CamelCase), or you can use tfsec to verify that the security configurations of the tenant are compliant with best practices.

In the third stage, you execute your Terraform code against a test/development/staging Application Policy Infrastructure Controller (APIC) using the Terraform syntax:

$ terraform apply aci_tenant.yml

In the fourth stage, you verify that the new tenant was successfully created and nothing broke. Your ACI fabric should still be working as expected. If that is the case, your fifth stage should apply your Terraform code to your production APIC. You can use environment variables to choose what environment to deploy to in each stage, or you can use other techniques, as you will see in Chapter 6, “How to Build Your Own NetDevOps Architecture.”

In the sixth stage, you verify your production environment after provisioning your new tenant.

Normally, APICs are long-lasting appliances and are not spun up and spun down; instead, they run continuously. Because of that, you would not have the last optional stage to delete your test environment.

Note that error handling was not mentioned in this example of a typical CI/CD provisioning pipeline. If any of the stages fail, the pipeline would stop and human intervention would be required to set it right. There are different ways to handle errors; for example, you could retry the same action or you could roll back. In networking, retrying the same action is typically not advised without first checking the error logs. Because of that, rollbacks are the most common way of handling provisioning errors. Depending on the automation tool used, rollbacks for the most part only remove what was provisioned and therefore are easy to achieve. In Chapter 4, “How to Implement NetDevOps Pipelines with Jenkins,” you will learn how to implement rollbacks in code.

Lastly, provisioning workflows are typically triggered manually by a user or by changes in the configuration variables in a code repository.

Configuration

Configuration, as mentioned, comes after resources are provisioned. This is the most common use case in NetDevOps. As the name implies, this use case consists of configuring resources, and both changes to an already existing configuration or new configurations fall within this use case’s umbrella. Specific examples include configuring a VLAN on a router, configuring a virtual machine with a software package, and configuring a new virtual switch port group in a VMware environment.

In order to configure your target resources, besides the desired configurations, you will need a configuration tool. Many different tools with different features are available, including Ansible, Terraform, Chef, vendor-specific tools, and even programming languages such as Python. In our context of networking, the predominant tool is Ansible. However, choosing a tool is complex and a topic described later in this chapter.

A traditional configuration workflow can be described as a series of steps executed in sequence:

Step 1. Connect to a resource.

Step 2. Verify the current functionality.

Step 3. Optionally retrieve the current configuration and functionality.

Step 4. Optionally compare the current configuration to the desired configuration.

Step 5. Configure the resource with the new configuration.

Step 6. Verify the desired functionality.

These steps are usually manually executed by an operator from a workstation or executed by an operator in an automated fashion. For example, in a networking setup, an operator can either SSH to a device and issue a show command from their workstation (steps 1 and 2) or run a script from their workstation that does the same two steps in an automated fashion. The fully manual approach is more error-prone.

In a NetDevOps fashion, you take the previous automated approach a couple steps further. Figure 2-3 presents a complete NetDevOps configuration pipeline.

FIGURE 2.3 Configuration Pipeline

This pipeline is similar to a provisioning pipeline, but this section focuses on the differences.

One interesting difference is due to the fact that configuration workflows are often configuration changes made to previous configurations. Unlike in the provisioning use case, where you create net new assets, in this scenario you need to take into consideration what was configured before. In the third stage of the pipeline, you do that by retrieving the current resource’s configuration. In this same stage, you can also retrieve information about the current functionality. However, many organizations prefer to separate these stages into two different ones, as shown in Figure 2-3 as the “Retrieve Metrics” stage; this allows them to manage the scripts to gather information independently of each other and achieve a higher level of decoupling.

Imagine you want to configure a new Open Shortest Path First (OSPF) neighbor in a Nexus switch. In the third stage, “Retrieve Config,” you retrieve the running-config section of OSPF and the configuration of the interface where the new neighbor exists. In the fourth stage, you retrieve the show ip ospf neighbors and the show ip route commands’ output.

The information gathered at this stage can be used to derive what configuration changes are needed. For example, if you want to replace a Simple Network Management Protocol (SNMP) key, it is different from configuring a new SNMP key on top of another existing one. This would result in having two SNMP keys instead of a single changed one. This type of logic needs to be implemented by you and is highly dependent on the configuration tool used.

In the fifth stage, “Deploy to Dev,” you configure the resource in the test/development/staging environment with the new configuration.

The next stage verifies that these configurations were applied correctly. You can, again, have a single stage that verifies both configuration and functionality, or you can use two separate stages. Figure 2-3 shows a single stage, “Testing,” but it’s recommended to use two.

Continuing with the previous example of configuring a new OSPF neighbor in a Nexus switch, in this stage, you would again retrieve the Nexus’s running-config section of OSPF and the configuration of the changed interface and make sure it has the newly configured parameters. On top of that, you would retrieve the same show commands and verify the command output shows the new neighbor along with new routes. In this stage, like in the provisioning use case, you can take your testing further and also test functionality (for example, a connectivity test to an endpoint behind the new OSPF neighbor). The depth of the testing should depend on the criticality of the change and your willingness to accept risk.

So far, all the previous stages were executed against a development environment. This is a major difference from the traditional workflow, where there’s a single environment—production. If everything functions as expected at the end of the verifications, you repeat the same stages but with a production resource as the target, represented by the “Deploy to Prod” stage.

Configuration workflows, just like the provisioning workflows, are typically triggered manually or by variable changes in a source control repository.

Data Collection

The simplest use case is data collection, where the goal is to gather data from resources. Although the target resources can differ (virtual machines, physical network equipment, controllers, and so on), the pipeline architecture is the same. Data collection is more often triggered on a schedule, like a cron job, than performed manually by an operator. However, these two are the usual triggers.

In a traditional scenario, operators can use different ways to retrieve data. In networking, a common technique is to connect to devices using SSH and retrieve show command outputs. Other data-gathering techniques include SNMP polling and using a device’s APIs.

It is important to note that all of these techniques are being increasingly automated. Instead of operators manually connecting to every device and gathering the command outputs, they now run scripts from their workstations that do this for them. However, NetDevOps takes this further by adding consistency, history tracking, and easy integrations to the process. Figure 2-4 presents a data collection pipeline.

FIGURE 2.4 Data Collection Pipeline

It starts with the usual stage of retrieving your source code from a code repository; in this case, the most commonly used code is automation scripts, namely using Python. Python’s network modules, vibrant developer community, and its easy-to-learn features, such as human-friendly syntax, make it the most used tool for this purpose. Nonetheless, Ansible and other tools can also effectively retrieve data.

In the second stage, “Gather Data,” the scripts run and the results are stored locally. This step can be time-consuming if there are several target resources or if the scripts retrieve a large amount of data from each resource. In the first case, you can speed up the process by having multiple data-gathering stages running in parallel. You will learn more about this technique in Chapter 4.

The third stage, “Process Data,” is optional but highly recommended. Data gathered from devices comes in its raw form, meaning it comes in whatever format the device outputs it in. Most of the time, this format is not the one you need. In this stage, you can use a programming language to parse the collected data into useful insights.

For example, if you issue a show cdp neighbors command, you get something similar to Example 2-2. This output is very verbose, and you might only need the name and the remote and local interfaces of the device’s neighbor to store in a database. This is where a script could parse the output and save it in a simplified format.

Example 2-2 Output for show cdp neighbors* on a Cisco Switch*

 Switch#show cdp neighbors
Capability Codes: R - Router, T - Trans Bridge, B - Source Route Bridge
                  S - Switch, H - Host, I - IGMP, r - Repeater, P - Phone,
                  D - Remote, C - CVTA, M - Two-port Mac Relay
Device ID         Local Intrfce     Holdtme    Capability  Platform  Port ID
Switch3.cisco.com
                  Gig 1/0/19        136             R S I  WS-C3850- Gig 1/0/23
Switch2.cisco.com
                  Gig 1/0/17        126             R S I  WS-C3850- Gig 1/0/24
Switch1.cisco.com
                  Gig 1/0/2         191             R S I  WS-C3650- Gig 1/0/24
Total cdp entries displayed : 3

The last stage, “Save Data,” is also optional. There are times when you are just collecting data for real-time verifications and you do not want to store what is collected; this is the equivalent of connecting to a network device by issuing a show command to verify something and immediately terminating that connection. However, if you do want to store what was gathered, you use this last stage. This stage is especially useful when you want to send the gathered data to multiple systems (for example, a monitoring system and a long-term archive). One goal for this last stage might be to anticipate network issues through the use of predictive machine learning (ML) models.

Compliance

There are many compliance frameworks and requirements. Most companies have to comply with regulatory requirements, such as HIPAA, PCI-DSS, SOX, DORA, and GDPR, on top of their own compliance policies.

In order to be compliant, companies have to prove their compliance. It is not enough to say they are applying the measures; in most cases, they have to prove the measures are actually in place. Historically, this was done through a series of manual steps executed by operators. Fast-forward to today, the process is a mix of manual and automated steps, mostly via manually operated automation tools. For example, if you must prove your organization is using SSH version 2 for all its network devices access instead of the older (and less secure) version 1, you must connect to every device in the network and retrieve and verify its configuration to show that.

As previously mentioned for the data collection use case, connecting to devices today is commonly achieved by an operator running a script on their machine rather than actually connecting and issuing a show command. However, for compliance, you want something more reliable for recordkeeping than a human operator. A NetDevOps pipeline, as shown in Figure 2-5, adds these benefits.

FIGURE 2.5 Compliance Verification Pipeline

The second stage, “Gather Data,” and third stage, “Verify Compliance,” can be implemented in a single stage if your automation script does both of those tasks together. Beyond just gathering data, you also need to parse it into a format you can use for the verifications. This should all be achieved by your automation scripts.

The fourth stage, “Apply Remediations,” is optional. However, it is a nice way of maintaining your environment’s compliance. It is hard to achieve in a general-purpose way; nonetheless, for common attributes that tend to fall out of compliance, you can develop a script to fix them. For example, in networking, you can forget to enable BPDU Guard on newly configured access ports. After the automation identifies these ports as noncompliant, your remediation script adds this configuration to the affected ports. Another example is forgetting to enable password encryption. This can be achieved by triggering a configuration pipeline.

Lastly, the “Generate Reports” stage aggregates all information into a report format. Compliance officers typically want a document with a set structure. This can be a single report or multiple reports. There are many ways to generate these reports, but the most common is to use a programing language to produce markdown documents that can also be tracked under version control systems. If you do not need to generate a specific report, the logs generated by the pipeline run itself can be enough to document compliance, and you could then delete this last stage.

CASE STUDY: PCI-DSS NETWORKING COMPLIANCE IN A BANK

Banks, because they are part of the payment industry, have to comply with the Payment Card Industry Data Security Standard (PCI-DSS), which is a set of security standards designed to ensure that all companies that accept, process, store, or transmit credit card information maintain a secure environment.

AnyBank, a large bank with a global presence in multiple countries across multiple continents, has a very large install base with different types of equipment from various vendors.

Because PCI-DSS standards are generic, AnyBank developed an internal document that details what kind of security controls are mandatory in all of its networking equipment. However, the bank faced a complicated challenge: each networking vendor implements its own command syntax, and unfortunately for AnyBank, it has multiple different vendors’ equipment in its production network. In order for compliance not to be left to the discretion of network operators, AnyBank developed its internal security controls document further, to the point that the document had a specific command for each platform expected for PCI-DSS controls.

To put this document to use, twice a year AnyBank hosted a compliance manual verification exercise, where each of its local branches verified its equipment configuration compliance against the document. This was an expensive and tedious exercise for everyone involved. Connecting to multiple devices and checking configuration databases against a set of practices is not fun. Additionally, after this exercise, the same engineers involved in the verifications needed to document their findings so that a central entity could generate a consolidated report.

AnyBank realized this process could be improved. It started by creating automation scripts, using Python, that pull the configuration from the devices in the network and match it against a set of compliance templates built from the initial document. The result was a simple status report stating whether a device is compliant and, if not, why.

These automation scripts greatly improved AnyBank’s ability to verify its compliance status; however, it still required operators to know how to run the scripts and had to point them to the right device types. Furthermore, AnyBank wanted this process to run continuously to achieve the best possible security posture.

Therefore, AnyBank evolved its solution to NetDevOps pipelines that use these previously created scripts to verify the bank’s compliance status on a scheduled basis. AnyBank developed multiple pipelines, each targeting a different geography, that can run separately. It also created a pipeline that aggregates all available compliance findings into a consolidated report.

On top of all of this, AnyBank exposed these pipelines to its network operators through a graphical user interface (GUI). With this solution, anyone, with or without automation knowledge, is able to assess the compliance status of the network at any time.

AnyBank had such success adopting NetDevOps practices for its compliance efforts that it further adopted these practices for other use cases. Currently, the bank does most of its provisioning, configuring, and data gathering in the same fashion.

Compliance verifications are typically only done during certain specific time periods (for example, when you are trying to achieve a compliance certification or at a certain time to comply with regulations). The automation capabilities discussed in this section allow for constant compliance verification, rather than the usual point-in-time verifications. Because these pipelines are automated and can run without effort, you are able to achieve a constant monitoring of your compliance status by triggering them on a scheduled basis. A benefit of this behavior is the ability to take faster remediation actions whenever noncompliant characteristics are detected. For point-in-time verifications, you can still trigger them in a manual fashion.

Monitoring and Alerting

Monitoring is observing something over time. In our context, it can be observing metrics, logs, or simply the configuration of our resources. This is not a new field; there are plenty of well-established monitoring techniques and tools out there.

Alerting is warning something or someone of a certain condition. In networking, this is typically associated with a numerical threshold or a specific error message. Monitoring and alerting are two sides of the same coin: without monitoring resources, you cannot accurately trigger alerts. Although you can monitor resources without creating alerts, you lose value. Imagine you are monitoring the CPU percentage of your network devices, and the monitoring solution identifies a specific router at 100% CPU usage. If you don’t have any alert on this condition, monitoring it does not add much value other than being able to know, at a later time, that the condition happened. On the other hand, if you do have an alert set, you can notify someone to act on this condition and mitigate the issue (likewise, you can notify something that triggers an automatic action, such as running a script).

As mentioned, monitoring and alerting are an established field. Because of that, you will most likely use a commercial off-the-shelf (COTS) tool. They are easy to install, configure, and use. These tools can be split into two categories: full end-to-end tools that manage the consumption, processing, and visualization of data, such as Cisco DNA Center Assurance and Datadog Network Monitoring, and tools that only manage processing and visualization of data, leaving you the responsibility of taking care of data ingestion, such as Splunk and the ELK (Elasticsearch, Logstash, Kibana) stack.

In the first use case, NetDevOps does not really have a role to play. However, in the second use case, where you own the data ingestion, a pipeline such as the one previously shown for data collection is useful. In the last stage you send the data, in the correct format, to the monitoring tools rather than a historical database. Another architecture sends the data to the database and configures the monitoring tool to monitor the data in the database, although this is less common. Alerting in these use cases is done in the monitoring tool itself, which is a widely supported functionally of these tools.

Besides the previously mentioned scenarios, there are two more cases: one where you decide to build your own monitoring and alerting solution, and one where you do not build a solution but use NetDevOps pipelines to achieve a simple alerting flow.

In the case where you decide to build your own tool, think again—this is not an easy task. In the case where you just want a simple alerting mechanism but don’t want to invest in any tool at all, you can use a pipeline like the one in Figure 2-6.

FIGURE 2.6 Monitoring and Alerting Pipeline

You can see the resemblance to the data collection pipeline, because monitoring is data collection over time. Therefore, you would trigger this pipeline on a scheduled basis.

At the end of the pipeline, you can see that when you send the processed data to storage, you also verify whether it passes a certain threshold or other configured alarm condition; if it does, you trigger an alarm. This does not need to be done in parallel, but doing it in parallel can trigger the alarm quicker.

Because the word “alarm” is used, you may think this needs to be a notification of some sort. However, this is not true. An alarm in this sense is just an action; it can be a notification such as an email or SMS, but it can also be an action such as calling an API or triggering another pipeline (for example, a remediation pipeline).

In a networking scenario, imagine you are monitoring the log file of a Layer 2 switch and you have configured an alert for the following error message format:

2022 Jul 14 16:04:23.881 N9K %L2FM-4-L2FM_MAC_MOVE2: Mac 0000.117d.
e03f in vlan 71 has moved between Po5 to Eth1/3

When your monitoring pipeline detects a MAC address moving between two different ports, which is sometimes a sign that an L2 loop is present on the network, it triggers a remediation pipeline that shuts down one of the two ports involved. On top of this, you could also notify someone that this action was taken, meaning there is no limit to the number of actions an alarm triggers.

Reporting

Creating reports is an activity many do not enjoy. However, they are needed for all sorts of reasons. Common reports in networking include hardware platforms, software versions, configuration best practices, and security vulnerabilities.

Each report has its own requirements and format. Drilling down on software version reports or software install base status reports typically requires three steps:

Step 1. Connect to a device and gather the software version.

Step 2. Verify the current version against recommended vendor version.

Step 3. Repeat Steps 1 and 2 for every device and then generate a report.

Like many of the previous tasks, most companies today have part of the process (at least Steps 1 and 2) automated, but they run these automated processes/scripts manually from their workstations. Nonetheless, for Step 2, the operator needs to obtain, beforehand, the recommended version for comparison. In an ideal scenario, the automation could fetch the recommended version at the time of verification. For this to be possible, this information needs to be available from the vendor on a website or in an API, which is typically the case.

It is worth noting the higher the number of different operators running the scripts (for example, operators in different geographies or branches), the higher the likelihood of human error.

Other report types, independent of their nature, share the same sequence of steps to generate:

Step 1. Gather data.

Step 2. Verify the data against rules.

Step 3. Generate a report.

A NetDevOps reporting pipeline architecture is presented in Figure 2-7.

FIGURE 2.7 Reporting Pipeline

This is similar to a compliance pipeline, but in this case, we do not want to apply remediations, at least not at this stage, because the goal is just visibility.

As previously mentioned, some reports require data from outside of your environment to be generated. In the software versions example, the required data is the recommended vendor version, and in the case of security vulnerabilities you need a matrix of vulnerabilities per software version. This information gathering can and should be implemented as a separate stage. You can implement it in parallel with the data gathering stage from your environment, as shown in Figure 2-8, or before that stage, right after checking out your code from the code repository, as shown in Figure 2-9.

FIGURE 2.8 Reporting Pipeline Integrated with a Third Party in Parallel

FIGURE 2.9 Reporting Pipeline Integrated with a Third Party Sequentially

Having it before instead of in parallel has the advantage of saving compute resources in the case that the needed information is not available. This means that your pipeline will fail without trying to gather information from your network because the verification data was not available.

Also, the previous examples show a capability offered by some proprietary network controllers such as Cisco DNA Center. A major advantage of this approach is reusability. You are able to create reports for anything you need using the same pipeline structure and automation scripts, with only minor modifications. You may have reporting needs beyond the typical ones; for example, you might have to report how many of your devices are running at or below 50% CPU usage or which of your devices are running out of physical memory. For these custom reporting needs, the proprietary controllers fall short.

Migrations

Network migration is a complex task that typically entails changing something to something different. Tasks can range from changing network configurations, provisioning new software, to replacing hardware. Nonetheless, it is important for you to understand that migration procedures are a combination of many smaller and simpler tasks.

Because of the criticality of some networks and the downtime some migrations cause, these are usually performed within maintenance windows. A maintenance window is a period of time, scheduled in advance, during which changes are made and service interruptions can happen. This concept is wider than networking or IT and is used in multiple other industries such as manufacturing and retail.

In networking, migrations are typically associated with a method of procedure (MoP) document. Although it can have various names, this document details the migration steps one by one. A network device hardware replacement migration, for example, typically consists of the following steps:

Step 1. Gather configuration data from the current device.

Step 2. Prepare configuration for the new device.

Step 3. Gather operational data from the current device.

Step 4. Configure the new device.

Step 5. Replace the current device with the newly configured device.

Step 6. Gather operational data from the new device.

Step 7. Verify operation data changes.

This is just an example; the actual steps may differ, depending on the migration use case. In a scenario where you do not have extra rack space, you have to switch the order of Steps 4 and 5. In a scenario where you are replacing the physical hardware but no configuration changes are required, Step 2 is not needed, as you can simply use the configuration gathered in Step 1. All of this is to show that each migration is different, so take this into account.

Independent of the migration scenario, most of the steps can be automated. However, the physical moving of equipment and cables cannot, or at least requires a different type of automation not covered by this book. Nonetheless, data collection, device configuration, and virtual device provisioning can be automated, as shown in previous sections. Many network migrations do not involve physical activities; therefore, those can be fully automated.

Today’s state-of-art migrations, as with many of the previous use cases, rely on automating these steps individually and rely on an operator to execute those automation steps. NetDevOps ties all the automation steps together while enhancing the overall experience.

Figure 2-10 shows a simple two-tier network topology composed of a single distribution switch and two access switches. The migration consists of replacing the distribution switch with a newer one.

FIGURE 2.10 Two-Tier Network Topology

To replace this distribution switch, we rack and stack a new distribution switch and connect it to the available ports on our access switches. On top of that, we configure this new switch with out-of-band (OOB) management to enable remote management access. An example pipeline for the migration is shown in Figure 2-11.

FIGURE 2.11 Migration Pipeline

This migration pipeline is divided into two pipelines: Dev and Prod. You can achieve the same functionality in a single pipeline, but the goal is to demonstrate the capability of a pipeline calling another pipeline.

You can see similarities between the migration pipeline and the configuration and data collection pipelines because this migration scenario is based on configuring a new device (configuration) using information from an already existing device (data collection).

In this scenario, you also see the return of the test/development/staging network. As mentioned, migrations tend to be critical, and using a test network to verify your changes before deploying them in production is a way to mitigate risk. In this particular example, you create a new test network to test the changes; this would be a virtual test network, as you will see in Chapter 5, “How to Implement Virtual Networks with EVE-NG.” There are other possibilities to test your changes. For example, if you already have a test network, you could modify the pipeline to configure your test network the same way as your production environment, replacing the stage “Create Test Network,” and then deploy and verify the changes there. In this case, the test network could be physical.

The provisioning use case briefly mentioned rollbacks. However, for migrations, rollbacks are a critical piece. As part of the MoP, network engineers prepare rollback configurations and actions in case the migration does not go as expected. This is fairly common. You see a rollback stage associated with a decision point in the pipeline in Figure 2-11; NetDevOps facilitates rollbacks in comparison to how they normally go. If you have been involved in migrations, you know that rolling back is almost always a high-pressure situation. If you are rolling back, things are already not going your way. On top of that, to roll back, you will need to make even more changes. When time is an important factor (for example, in very short service cut migrations), the pressure to roll back quickly is huge. This is a very error-prone activity. In the pipeline scenario, the rollback is prepared in advance without pressure. The rollback configurations can be tested beforehand, and you can be sure that the automation will not make any copy/paste errors.

CASE STUDY: MIGRATIONS AT AN ENERGY COMPANY

AnyProvider is an energy provider that powers millions of homes in AnyCountry. Like other energy providers, AnyProvider has a private network it uses to collect metrics and communicate across its many stations. On top of that, it also owns and manages several campuses across the country. For AnyProvider, networking is critical because an outage means loss of visibility of its infrastructure and potentially loss of business.

AnyProvider hired an external company, AnyCompany, to help migrate one of its campuses. This was a fairly large campus with four buildings, 12 floors per building, and around 4000 employees. This campus was powered by different types of networking equipment, among them around 800 switches, 2000 access points, and over 5000 other connected devices.

The migration consisted of replacing the current switch install base with a newer switch model while incurring the smallest possible downtime. To achieve this, AnyCompany needed to physically install the newer switches, configure them, and make the traffic switch while ensuring everything still worked as intended.

AnyProvider is a traditional company and hired AnyCompany to perform this migration in the traditional fashion, meaning network engineers onsite configuring the new switches with the required configurations and manually verifying that the same number of endpoints were active before and after the switch. The migration was divided into waves because of the size of the campus.

Once they started the migration, the planned method was going well, up until after the third wave of migrations. The network engineers onsite noticed that on one of the migrated floors some endpoints that were previously recorded to be working were no longer working. After a long night of troubleshooting, the engineers finally found the root cause. One of them had copy/pasted the wrong configurations in one of the switches on that floor. They fixed the error, and everything was working again. However, this mistake did not go unnoticed by management.

The management of AnyProvider was not happy with the extended service outage, and one of them suggested that they wanted AnyCompany to enhance their migration mechanisms to use automation and orchestration techniques so that this would not happen again.

AnyCompany involved extra engineers on the project to automate what could be automated. They started by creating Ansible playbooks to retrieve the current configurations from existing devices. Playbooks to create the configuration and then configure the new devices followed. By this time, manual errors were minimized because onsite engineers no longer copy/pasted configurations to and from Notepad. However, they were still facing long endpoint behavior-verification times. This was automated next; the engineers created Python scripts to retrieve “pre” and “post” information from the network devices so they could compare their status as well as created behavior verification scripts, such as Wi-Fi connection simulation and IP phone calls.

By the time all these automation scripts had been created and were being used, the onsite engineers were having a tough time running them in the correct sequence without losing the outputs created, which were critical for accountability.

The automation engineers took this migration a step even further with NetDevOps. They created migration pipelines that took care of the ordering of actions and maintained not only the outputs but also the accountability of who ran what pipeline and when.

These pipelines were not fully automated, and most of them had manual input stages asking for engineers’ confirmation because some of the migration steps required changing cables, which was not automated.

In the end, this team of engineers was migrating six floors per night instead of the planned one floor—a sixfold increase in speed while minimizing errors and service downtime.

For migration pipelines, the only trigger is manual. You could use a different type (for example, a scheduled trigger), but migrations are typically such high-risk activities that they require human supervision, even when they are being executed by NetDevOps pipelines.

Troubleshooting

Some engineers love troubleshooting—the feeling of chasing and solving an unexpected behavior—but many others hate it. Troubleshooting is both a skill and an art, many times fueled by the rush of needing to fix something quickly because the issue is affecting a production service. Indeed, troubleshooting is often required in the worst moments. Do you remember the last time you had to troubleshoot something? Was it a calm situation in a development environment, or was it a production outage?

Troubleshooting often requires deep technical knowledge that not everyone in the company has. For example, if you are responsible for a production service that has a pager associated with it, do you let a newcomer, even if they are an expert on the technology, take part in troubleshooting the service. At least not initially, until the newcomer is fully onboarded and knowledgeable about the intricacies of the service. This shows that troubleshooting not only requires technical knowledge but often also subject matter expertise in the specific service itself.

Independent of all these challenges, a troubleshooting workflow is quite well-defined and includes the following steps:

Step 1. Gather data from the affected resources.

Step 2. Analyze the collected data and formulate hypotheses.

Step 3. Experiment with the most likely hypothesis by configuring or reconfiguring resources.

Step 4. Test for success.

Step 5. Repeat Steps 3 and 4 until you’re successful.

Not in every scenario can you apply Steps 3 and 4 multiple times and experiment with several hypotheses. In some scenarios, you need to analyze the data until you are certain of the problem and the solution. Nonetheless, after you are certain of these, you still apply Step 3. It is notoriously difficult to be 100% certain, and even if you are, you still need to apply the solution and verify the success.

In a networking-specific scenario, the aforementioned troubleshooting workflow is very similar:

Step 1. Connect to the possible affected network devices.

Step 2. Collect show command outputs.

Step 3. Analyze the collected outputs and formulate hypotheses.

Step 3a. (Optional but common) Repeat Steps 2 and 3 until you arrive at reasonable hypotheses.

Step 4. Reconfigure/configure any identified missing feature in the network devices.

Step 5. Test for success.

Step 6. Repeat Steps 3 and 4 until you are successful.

Sounds pretty simple, right? Well, it is not. A network problem can manifest itself, and usually does, with pretty generic symptoms, such as loss of connectivity for some endpoints, increased latency, or users complaining their access is “slow.” From such a generic description it’s typically hard to pinpoint the specific network devices affected; therefore, your first step has a large number of target devices. On top of that, after connecting to your devices, assuming they are the correct ones, what show commands do you run? A common technique on Cisco devices is to start with the generic ones, such as show logging, as shown in Example 2-3, or show running-config. Other vendor devices have similar commands with different syntax.

Example 2-3 Output for show logging* on a Cisco Switch*

SWITCH1#show logging
Syslog logging: enabled (0 messages dropped, 3 messages rate-limited, 0 flushes, 0
  overruns, xml disabled, filtering disabled)
No Active Message Discriminator.
No Inactive Message Discriminator.
    Console logging: level debugging, 34 messages logged, xml disabled, filtering
disabled
    Monitor logging: level debugging, 0 messages logged, xml disabled, filtering
disabled
    Buffer logging:  level debugging, 34 messages logged, xml disabled, filtering
disabled
    Exception Logging: size (8192 bytes)
    Count and timestamp logging messages: disabled
    Persistent logging: disabled
No active filter modules.
    Trap logging: level informational, 38 message lines logged
        Logging Source-Interface:       VRF Name:
Log Buffer (8192 bytes):
*Mar  2 00:00:01.137: %VIRTIO-3-INIT_FAIL: Failed to initialize device, PCI
  0/6/0/1002 , device is disabled, not supported
*Mar  2 00:00:01.381: %ATA-6-DEV_FOUND: device 0x1F0
*Mar  2 00:00:08.485: %ATA-6-DEV_FOUND: device 0x171
*Mar  2 00:00:08.704: %NVRAM-5-CONFIG_NVRAM_READ_OK: NVRAM configuration 'flash:/
  nvram' was read from disk.
*Feb  12 08:51:58.706: %PA-3-PA_INIT_FAILED: Performance Agent failed to initialize
  (Missing Data License)
*Feb  12 08:52:05.064: %LINK-3-UPDOWN: Interface GigabitEthernet0/0, changed state
  to down
*Feb  12 08:52:05.068: %LINK-3-UPDOWN: Interface GigabitEthernet0/1, changed state
  to up

Step 3 is highly correlated to Steps 1 and 2. You will formulate hypotheses based on your findings. And most of the time, you get stuck on Step 3a, bouncing between Steps 2 and 3 before you make any type of configuration change.

You make it to Step 4, however, and after you make your change, your users are still affected by their initial condition. Step 5 is a failure, but there is another thing you need to look out for: Did the change you make break something else? Maybe something else completely unrelated to the problem you were investigating? These are hard questions to answer in a traditional network setup, but you already have the solution: NetDevOps.

Troubleshooting is a collection of smaller use cases: data collection, configuration, provisioning, monitoring, and so on. What you have learned so far applies to this use case. Instead of manually connecting to devices and collecting show command outputs (Steps 1 and 2), you can run parameterized data collection pipelines that target the intended devices. Likewise, after formulating your hypothesis, you can codify those changes into a source control repository and run a configuration or a provisioning pipeline, depending on the troubleshooting scenario, and execute the changes with a higher degree of confidence. On top of that, the success criteria testing could be baked into the pipeline with an automatic rollback stage, as described previously.

When testing for your success criteria, be aware of the time it can take to propagate a change; likewise, when you are doing a manual change, you might issue a show command multiple times before it shows that the changed output using automation is the same. Another example is when you reboot a device: It takes time for the device to come up again and accept connections, and you often find yourself repeating the ssh command multiple times, or you have a ping running to the device and only reconnect when the ping is successful. In an orchestration scenario, take this into consideration. Add a wait time stage if you know the target device will not be immediately available. On top of that, you have a retry configurable option at the stage level in most CI/CD tools, which you can use in your verify success criteria stage.

All these changes combined, or even a subset of them, will not make you love troubleshooting if you hate it, but it can make troubleshooting easier, more reliable, and reduce mean time to repair (MTTR).

There are other applications of NetDevOps in troubleshooting. It can abstract the troubleshooting activity as a whole, disguised as a pipeline. This is not as simple to build as the previous example where you replaced the individual troubleshooting workflow steps with automated activities, but the reward is higher.

Figure 2-12 shows you a troubleshooting pipeline that is triggered automatically by a monitoring system when an alarm is triggered. The secret sauce of this pipeline lies in the step of automated machine reasoning. Automated reasoning is an area of computer science that is concerned with applying reasoning in the form of logic to computing systems. If given a set of assumptions and a goal, an automated reasoning system should be able to make logical inferences toward that goal automatically. In our context and put simply, it is a system that tries to understand what is happening in our network and infer potential solutions.

FIGURE 2.12 Troubleshooting Pipeline

How you build your own automated reasoning system is well beyond the scope of this book; however, you can partly accomplish this by using a rule-based system.

Imagine the following scenario: You manage a network that commonly suffers from L2 loops. You do not run the Spanning Tree Protocol because of fast convergence requirements, and sometimes your engineering team forgets that and creates looped topologies. In this scenario, you can benefit from having an automated rule-based engine that troubleshoots this issue for you. A subject matter expert (SME) would typically connect to the affected devices, identify interfaces with high utilization and possible packet drops, identify MAC address flaps either using the switch’s log or the MAC address table, and then break the loop. However, if you are not a seasoned SME, you might lose time collecting other show command outputs and at the end create a hypothesis regarding problems other than the L2 loop.

With an “auto-troubleshooting” pipeline, you can abstract what is being collected and analyzed from the devices and output to the network operator only what it thinks the underlying issue is. Of course, it is also possible that this pipeline applies the fix directly, but most of the time in networking use cases, companies want human confirmation.

This works great for common issues such as BPDU Guard–blocked doors and mismatched protocol timers. However, for complex troubleshooting scenarios, you will need a very good rule-based system, which is not easy to create.

Although the previous example was triggered automatically from a monitoring system alarm, you can create “auto-troubleshooting” pipelines for common issues of your network and let operators trigger these manually. This is a good first step, and it also reduces the subject matter knowledge required to troubleshoot common issues. If this does not solve the issue, it can be escalated to a higher-tier, smaller team.

CASE STUDY: INSURANCE TROUBLESHOOTERS

AnyInsurance is a global insurance company operating in four continents. It offers a variety of services, ranging from insurance to asset management. For AnyInsurance, maintaining minimal downtime is critical, not only because of its global time zone coverage but also because of the regulations it is subjugated to.

Although AnyInsurance has a big dependency on IT systems, it is not an IT company. This was reflected in its headcount, with only a few IT engineers maintaining their main two data centers located in Europe and the USA. This strategy was a disaster waiting to happen, and it did. In 2020, AnyInsurance had a data center–wide outage rendering it out of service for almost 24 hours. The data center was eventually recovered, and the root cause was identified to be a device misconfiguration on the border network—the network between the last devices and the Internet.

These kinds of outages happen, but what AnyInsurance identified to be a problem was the time it took to address the issue. Turns out the company’s main engineer, the go-to guy for all things networking, was out the week before and the week of the outage. Because of that, changes were made to the network without his supervision, which would be normal in most companies; however, at AnyInsurance, this guru was one of the only SMEs. Without his knowledge, it took several hours to identify and fix the problem.

You could say that this is a staffing problem and has nothing to do with NetDevOps. However, AnyInsurance contacted a third-party consulting company for help, and the company’s recommendation was to enhance the troubleshooting process and embed it with intelligence.

To accomplish this, the third-party firm worked jointly with the few AnyInsurance SMEs to identify what were common issues in the data centers and what technologies they were using, among other things. In short, they tried to pack some of the SMEs’ knowledge into automated tooling, and they were successful.

They created automated pipelines that run on alarms as well as manually triggerable pipelines that more junior and less-knowledgeable engineers could point at specific devices. These pipelines outputted common show commands recommended by the SMEs as well as recommendations of possible issues that might be occurring based on those same outputs.

The pipelines did not replace the SMEs; however, they allowed AnyInsurance to keep its staffing model of having few SMEs and many operators while minimizing downtime by enhancing the troubleshooting capabilities of the common operator.

Combined

The two previous use cases, migrations and troubleshooting, are combined use cases. They aggregate the previous simpler use cases into more complex end goals.

In networking, complex goals are often what you will encounter. Nonetheless, these complex goals and tasks can be decomposed into smaller, more manageable subgoals and subtasks, and that is what you will learn about in this section.

An interesting combined use case is the use of one pipeline to gather data and store it in a database and the use of other pipelines in the same network for retrieving that data from the database instead of fetching it from the devices directly. A database in this context can be anything, even a local file. This type of interaction between pipelines will reduce resource consumption on end devices because you are not connecting to them for every action. Although all the use cases in this chapter so far have gathered data from the end devices, you can now see this is not necessary.

Another interesting combined use case is network optimization. While your data collection pipelines are collecting data from your network and storing it somewhere (for example, in a database), you can have a pipeline monitoring the stored data, looking for patterns or optimizations. For example, if you are collecting and storing information on the bandwidth utilization of your interfaces, it is possible that your monitoring pipeline will identify underutilized interfaces. In this case, it can trigger a configuration pipeline that alters some traffic-routing metrics to reroute traffic and better utilize your available infrastructure.

There are increasingly more uses for networking data. A practice that is becoming more common is to apply machine learning algorithms to identify patterns in networking data. Just like in the previous scenario, assume that you are collecting and storing your switches’ data—this time percentage CPU utilization. You can build a machine learning model, which some consider to fall within the automation umbrella, and integrate it in a NetDevOps pipeline.

In simple terms, a machine learning model has two phases in its lifecycle: the first phase is where it needs to be trained, and the second phase is where you can use it to make predictions (called “inference”). In the first phase, you “feed” data to the model so it can learn the patterns of your data. In the second phase, you give it a new data point, and the model returns a prediction based on what it learned from past observations.

Continuing our previous example, you can train a model based on percentage CPU utilization per device family. CPU utilization is highly irregular—what is an acceptable value for a specific device doing a specific network function might not be an acceptable value for a different device in the same network. Because of this, it is very complicated to set manual thresholds. Machine learning can help you set adaptable thresholds depending on the specific device based on its historical CPU utilization.

Now what does all of this has to do with NetDevOps? You can have a pipeline that retrains a model when predictions become stale. Likewise, you can call the inference point of the model in one of your alerting pipelines and replace the static alarm thresholds.

Machine learning is starting to have many applications in the networking world—from hardware failure prediction to dynamic thresholds and predictive routing. It is important to understand that if you are using NetDevOps practices, adding machine learning into the mix is simple.

Do you have a use case not covered in this chapter? NetDevOps is only limited by what you can do with automation. So, if you can automate it, you can make it run on a CI/CD pipeline using source control and testing techniques to reap all the benefits you have learned in this chapter.

Decisions and Investments

Let’s say that you love the use cases of NetDevOps because they resonate with your current challenges. So now you ask yourself, “How do I start, and do I buy something?”

In order to adopt NetDevOps, or any other technology, you will have to make several decisions and possibly some investments. This section covers the four main verticals you should consider:

Starting point
New skillsets
New tools
Organizational changes

These might seem like a lot of investments; however, considering the benefits, they are worth it. NetDevOps has some initial investments that decline over time, while its benefits grow over time, as shown in Figure 2-13.

FIGURE 2.13 NetDevOps Investments Versus Benefits Chart

Because this is a fairly new field compared to others in networking, it is hard to find trustworthy resources about it. The four main verticals described in this chapter are derived from the authors’ own experience in the field for the last five years working with NetDevOps. They are not industry standards.

Where to Start

When you start something new, you must begin somewhere. For example, when you are learning a new technology, you can start by reading a book or watching a video online. For some things, where you start does not matter because you’ll ultimately reach the same destination; however, when it comes to adoption of NetDevOps practices, choosing the place to start is very important.

So where should you start? There is no silver bullet or a single place where all organizations should start; rather, each organization should undergo an analysis to decide what is best for its situation. This preliminary analysis should evaluate roughly where the organization is in terms of the following:

Challenges/pain points
Skills
Technology stack

Why these three? There are other verticals you can consider, but evaluating these three typically results in a good starting point. Besides, the state of these three verticals is often well known to organizations, making this initial analysis cheap and fast. You do not need to produce a formal analysis with documentation, although you can do that if you wish. The result of this analysis should be an understanding of where you are in regard to these three topics.

After you have the understanding of where you are, either documented or not, you should add more weight to the first vertical, challenges/pain points. You should start your journey with use cases in mind. Do not try and embark on the NetDevOps journey because of trends or buzzwords. Solving the challenges you have identified that are affecting your organization is the priority.

Prioritize the identified challenges based on their importance for your business but at the same time measure the complexity of each challenge. The result should be an ordered list. This balance between complexity and benefit is sometimes hard to understand, so use your best judgment because this is not an exact science.

So far, you have not factored in the skills and the technology stack verticals from the analysis. This is where they come in. From the ordered list of challenges, add which technologies are involved from the technology stack and what skills would be required to solve them. Some of those skills you might already have, while others you might not. The same goes for the technologies.

Skills come in second in our three verticals. Although the next section focuses solely on skills and how they influence your NetDevOps journey, they also play a role in defining a starting point. Prioritize use cases that you already have the skillsets to implement. Technology comes next, because it is easier to pick up a new technology than a new skillset. However, this does not mean that adopting a new technology is easy, because it is not, and that is why we include it as our third factor.

For the technology stack, there will be many different nuances, and some use cases will not require all NetDevOps components to solve. For example, if your challenge is that people can make modifications to your network device configurations that go unnoticed, creating snowflake networks, and you need a way of maintaining a single source of truth, the only component you need is a version control system repository. Similarly, if your challenge is lack of speed and error-prone copy/paste configuration activities, you might just need to apply automation instead of also using CI/CD pipelines.

Understanding the minimum number of NetDevOps components you will need to adopt makes the journey to success shorter, which leads us to the highest contributing factor to the success rate of NetDevOps adoption: the ability to show successes at the early stages of adoption. Do not underestimate this. It not only motivates the teams involved, but it is a great way to show stakeholders their investments are paying off. Experiencing failures early or going for a long time without anything to show for it is the downfall of many adoption journeys.

However, you cannot really show success if you do not have success criteria. After you decide which use cases you will solve using which NetDevOps components, make sure you define what success looks like. Following up the previous example of snowflake networks and configurations changes that go unnoticed, the success criteria could be to have 80% of devices in the same functions with the same configuration, and to measure this you would audit the network. For the second example, in the slow and error-prone configuration changes environment, you could measure your current estimated time to implement a new configuration and the number of minutes of downtime caused by changes. Then you could define a success criterion of lowering this number by 20%.

Having specific success criteria allows you to show progress and improvement; however, it can also show that you are not actually solving your initial use case. This can be equally beneficial because it enables you to adjust your initial plan. In other words, failing quickly is less expensive.

To summarize, start with understanding where you are right now in terms of challenges, skills, and technology. Make identifying challenges the number-one priority because you want to ensure you are solving something relevant for your organization. Next, prioritize your challenges based on the skills you already have while minimizing the amount of NetDevOps components involved, prioritizing the ones you already have. Before you implement your NetDevOps strategy, make sure you have clearly defined success criteria and plan to show milestone successes early.

Skills

Skills are an influential factor in NetDevOps. In Chapter 1, you learned that many components of NetDevOps are not traditional networking components, and although some folks take them for granted, automation, programming, and orchestration are not an evolution of networking; rather, they are a horizontal domain of knowledge.

Most organizations have network engineers equipped with the traditional skillset of “hardcore” networking. This includes routing protocols, switching configurations, networking security such as access control lists (ACLs) or Control Plane Policing (CoPP), and all the rest. But in the same way that software developers do not know what the Border Gateway Protocol (BGP) is, traditional network engineers do not know what Jenkins or Groovy is.

The profile of a network engineer is evolving, and nowadays we’re starting to see more and more a mix of networking, infrastructure as code (IaC), and orchestration knowledge. However, that might not be the case at your organization. If it is not, there are two schools of thought: upskilling/training and hiring.

You can choose to train your engineers in the skills you have identified to be missing from your “where to start” analysis, or you can hire folks who already have those skills. One option is not better than the other; each organization must make its own decision.

If you choose to train your engineers, you must take into consideration that, as previously mentioned, some of these NetDevOps skills are not a natural evolution of networking and can require considerable effort to learn. For example, software-defined networks (SDN) can be seen as an evolution of networking, and in some way they are the next networking topology. However, writing an automation playbook in a programming language is not an evolution of writing network configurations. Although the line is becoming blurry, and some terms like “network engineer v2” and “next-generation network engineer” have started to emerge, historically speaking, networking and automation have been two different domains.

Not all skills are generic skills such as automation or networking. A skill family that often is overlooked is tool skills. An engineer proficient at automation will not be an expert in all the automation tools; for example, the engineer might have worked extensively with Ansible but never with Terraform. This is particularly important if your chosen strategy is to hire, because most of the time you do not want to upskill a new hire on a new tool if you can hire someone with tool knowledge. In terms of training, this also factors in. Training someone in a tool is easier if that person already has knowledge of the tool domain. In other words, training someone in Golang is easier if they already know how to program in Python.

Another consideration is how you want to distribute your skills in each role; the upcoming section “Organizational Changes” covers how you can distribute your skills: all in one NetDevOps team or separate automation and networking teams. The number and distribution of engineers’ skillsets differ based on each organization’s needs.

Lastly, because you are reading this book, you probably want to become a NetDevOps engineer or transform your organization into one that uses NetDevOps engineering practices; however, not every network engineer will become a NetDevOps engineer. Expert-level networking skills are still required, and many folks may not have to take part in orchestration and automation tasks. This is a common misconception.

Tooling

Tools are an important part of NetDevOps. As you have learned, tools are enablers and not actual DevOps practices; however, some folks still commonly label tools as DevOps. Nonetheless, tools will still represent a big part of your investment. Not only because of their price but also because of tool-specific skills and knowledge. After you and your organization acquire these skills, changes result in added effort and cost.

Within the NetDevOps umbrella, you can separate the tools into the following different categories:

Infrastructure as code
Continuous integration/continuous delivery (or deployment)
Source/version control
Testing
Monitoring

The following list provides examples of tools in each category. Note that this is not an exhaustive list; each category has a plethora of tools to offer.

IaC
- Ansible
- Terraform
- Pulumi
CI/CD
- Jenkins
- GitHub Actions
- AWS CodePipeline
Testing
- EVE-NG
- GNS3
- Cisco Modeling Labs CML
Source Control
- GitHub
- GitLab
- Bitbucket
Monitoring
- Datadog
- ELK stack
- Splunk

Note that the IaC, CI/CD, source control, and testing tools were covered in Chapter 1. Monitoring is a well-known tool vertical in networking that has evolved over time. From the older SNMP pull-based monitoring to the newer push-based telemetry models, common tools to achieve this functionality are proprietary network controls such as Cisco DNA Center, SolarWinds Network Performance Monitor, Splunk, and open source solutions you can tailor to your liking, such as ELK (Elasticsearch, Logstash, Kibana) stack. Monitoring in NetDevOps also encompasses the monitoring of CI/CD pipelines and automation tasks. This is an extended scope compared to only monitoring the network.

Here is a set of characteristics you can use to select the best fit for your organization in any of these tool categories:

Cloud or on-premises
Managed or unmanaged
Open source, proprietary, or in-house
Integration ecosystem

The first characteristic is where the solution will be hosted. All the tools have to exist somewhere, and some have a bigger footprint than others (for example, a CI/CD server with many agents versus an IaC tool that only needs a single server). For the location, you have two choices: the cloud or on-premises. Some folks will argue you have three options, with the third being a co-location facility (for example, a service provider data center). However, this option is encompassed in the on-premises category in this two-location system. The cloud refers to on-demand resources accessed over the Internet. It is a huge trend right now and typically benefits from a pay-as-you-go model. Of the many cloud vendors, the most well known are Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). The benefit from using the cloud is that you do not have to manage or secure the physical infrastructure. You can just access your resources as you need them. In contrast, the on-premises option will give you more flexibility when it comes to controlling the storage of your data, and you do not have to rely on Internet connectivity.

The second characteristic is manageability. In the cloud versus on-premises characteristic, you saw that with the cloud, physical management is the responsibility of the provider. Some tools offload more than that; they offload all the management to the provider, other than the actual configuration. For example, you can have access to a working Jenkins instance, and all you need is to create your workflows. You do not need to install Jenkins, configure networking for the instance, or anything like that. For on-premises setups, this is more uncommon, although some service providers manage the installation of tools for you. In general, you should only consider this a feature for cloud-hosted tools. Examples of managed tools are Amazon Managed Grafana, versus hosting your own Grafana in a virtual machine on the cloud; Amazon OpenSearch service, versus hosting your own Elasticsearch cluster in virtual machines; Terraform Cloud, versus hosting your own Terraform environment in containers or virtual machines; and CloudBees-hosted Jenkins, versus hosting your own Jenkins in AWS.

The third characteristic is of special importance because it greatly contributes to the price. All tool categories will have open source solutions; however, these solutions will not have support other than community support. For enterprise environments, this might be a problem. Nonetheless, many organizations offer enterprise-grade support for open source tools. Therefore, don’t rule out open source tools just because they are open source. Ansible, for example, is a free and open source configuration tool, but Red Hat has enterprise support plans for it. An advantage of open source tools is the wealth of knowledge you will be able to find online versus the more exclusive proprietary tools. In contrast to open source tools, proprietary tools are solutions owned by the individual or company who published them. They are “closed” in the sense that changes must be made by the party that published the tool. However, this type of tool usually has several support offerings. This is the most widely adopted tool type in medium- and large-sized companies.

The other option is to build your own tools. Although this option is uncommon, it is not unheard of; some organizations decide to build their own in-house tools using programming languages such as Python and Java. This option requires highly specialized staff, and after the development phase, you will also need to provide support. The advantage is that the tool will have only the functionality you need and not numerous features you never use, as commonly happens with commercial off-the-shelf (COTS) tools. On top of that, if the tool needs modifications (for example, you find a buggy behavior), you are the vendor, so you can immediately apply a fix. If you are starting out in automation and orchestration, however, this option is not recommended for you.

The final characteristic is the tool ecosystem. This characteristic is sometimes undervalued, but having a tool integrate natively with the rest of your tooling can be a very big advantage (versus having to script a lot of functionality). For example, most source control tool vendors now offer CI/CD tools embedded together. This greatly simplifies integrations, and you can run your workflows directly from the source control repository. Examples of this are Gitlab CI/CD and GitHub Actions. You will also see very tight and native integrations when you are using all the tooling from the same cloud vendor, such as Azure DevOps ecosystem and Amazon Code* services.

Ensure you are prioritizing skills over tools. A characteristic that did not make it on the list is the available skills in the market. Your prioritization should start with the tool skills you have available; however, you should consider how widely adopted within the industry a tool is because this will greatly improve your ability to hire folks who are proficient with the tool or to find training materials for your own folks.

Now that you know what characteristics to look for in a tool, you need to adapt this to your organization’s needs and liking. Be aware of the tradeoffs; for example, continuing to use Ansible for provisioning while understanding Terraform might be a better solution for your use case if you already have Ansible skills and existing support from Red Hat. Again, there is no single “works every time” choice, but aligning your tool choice to your organization’s strategy helps. For example, if your organization is implementing a cloud-first strategy, cloud-hosted tools are preferred. Likewise, if your organization is not an IT-focused organization and instead focuses on a different core business, offloading the management of the tools (sometimes called “undifferentiated heavy lifting”) to focus your resources on differentiation activities is likely the right choice.

Finally, because so many tools are available right now, you need to be careful of tool sprawl, meaning having too many tools and even unused tools. It is okay to use specialized tools that are very good at a specific task or action, but it is important not to let the tools take over your organization. Likewise, retire old tools if they are no longer being used. As you have learned, NetDevOps is a set of practices, not tools, but using the right tools for the job is one of those practices.

Organizational Changes

Network operators and network architects—that is, folks who design and configure networks—are typically already on the same team or work closely aligned. In the development world, however, developers and operations were many times on completely separate teams. This is good because adopting NetDevOps has less of an impact on an organization’s structure as DevOps did back in the day in software development.

However, being on the same team does not mean you do not need any organizational changes at all. For some more traditional teams, adopting these practices might require hiring new folks, as described previously in this chapter, which is an organizational change already.

You can tackle NetDevOps in one of two ways: join automation and networking together into one team or have automation and networking in separate teams but collaborating.

If you choose to separate automation and networking in different teams, which is not recommended, your organization needs to find a way of bridging these two areas. For example, your networking folks create configuration templates per device type and platform, while your automation team creates the automation scripts and orchestration pipelines to deliver the configurations. If you chose this approach, remember that each of these teams has no idea about the others’ domain expertise and challenges, so communication and collaboration are key. Working in isolation will greatly impair their ability to deliver fast and working solutions.

If you choose to relabel your networking team as a NetDevOps team, your folks will be doing end-to-end tasks. This is the recommended approach. With this method, everyone has an understanding of the use cases and challenges, which is beneficial in the successful adoption of these practices. This is the most ambitious approach, and you may face more resistance from folks who prefer the traditional approaches to networking.

Although this might seem like a simple rebranding exercise, it is not, and it is important to have support from the organization and key management stakeholders. This support is paramount to address resistance from less-inclined engineers as well as to address and justify a potential initial loss of productivity or higher costs due to ramp-up.

Another important organizational change is to adopt open communication and encourage failing fast. When new processes, tools, skills, and technologies are being adopted, mistakes will happen and questions will arise. It is important to foster a culture of collaboration and open communication, where engineers are encouraged to present their questions and doubts while experimenting and, in some cases, failing. This is a NetDevOps principle. Although this might not seem like an organizational change, many organizations do not openly embrace open communication and failing, and they are surprised by this aspect when adopting NetDevOps.

The junction of the networking and automation domains, which likely have been working separately until now, should be reflected in your organizational structure. However, independent of your decisions regarding skills, tooling, and where to start, you should understand that it can take time to successfully reshape your organization into this new paradigm.

Adoption Challenges

Adopting new practices can be challenging, and although you now know that NetDevOps is a mix of already well-known and battle-tested practices, it is still likely that you will face challenges in your journey.

This section describes common challenges and recommended mitigations associated with the adoption of NetDevOps in organizations of all sizes.

Do not get discouraged if you face one of the following challenges. Adoption of new technologies often comes with initial challenges, but the benefits far outweigh the initial burden.

Remember the RIP routing protocol? Frame relay? Half-duplex Ethernet? When introduced, they were different, disruptive, and folks had to learn them. This initial hardship was, however, worth it. A lot of current networking technologies have evolved from these.

Traditional Mindset

The networking field is commonly associated with traditional or old technologies. Although this is not necessarily a true assumption, some networking practices are indeed rudimentary and old, such as physically connecting to devices via a cable and typing commands one by one on a command line. This is not to say this practice is wrong or that there was a better way of performing these tasks before—if a device was isolated on a network, there were not many options other than physically plugging in a laptop to it. Nowadays, there are more options, such as zero-touch provisioning (ZTP) for Day 0 configuration and the use of automation tools for Day 1 and Day 2.

One challenge you will encounter is dealing with organizations and folks who are attached to the old ways of performing tasks. Before, their way may have been the best way, as just described; however, now, there are likely better ways. Some folks will resist change and refuse to adopt new practices.

A common complaint is, “We’ve been doing it this way for X years.” It is not easy to convince these folks of a better way of doing things; however, here’s one way that typically works: instead of leading with the possible benefits of the solutions you are trying to adopt, perform a proof of concept (PoC) with other collaborating individuals and return to the skeptics with factual results, such as improved time to execute change requests or less downtime. It is harder to ignore results than a business pitch. Likewise, the competition factor of seeing others succeed with different techniques will often increase the likelihood of folks wanting to join in. In other words, they do not want to be left out.

Another important aspect, as mentioned previously, is senior stakeholder support. A clear request coming from senior leadership is harder to ignore than a colleague’s request. For any organizational transformation, leadership support is vital, and NetDevOps is no different. Try to find this support early in your adoption journey, preferably right from the start.

Testing or Lack Thereof

Network testing has always been associated with acquiring additional expensive hardware and putting in extra work. Because of these reasons, and others, testing network changes is often kept to a minimum by most organizations, and many times happens in a production environment. When was the last time you copy/pasted commands into a production network device within a maintenance window without previous testing? What about the last time you tested a network configuration change in a test environment that mirrors your production environment? If you only remember the first scenario, you are not alone.

In the software development world, testing is part of the culture. Writing tests for your code and executing them is a standard practice across the industry. There are even software development processes such as test-driven development (TDD) where tests are written before the actual software implementation of a feature. DevOps embraces testing as a way that enables safe, continuous integration and aims to make sure nothing preexisting is broken with new modifications. Likewise, as you have learned in this book, NetDevOps also makes extensive use of testing for networking. Networks are critical, and network changes should be executed with confidence.

You will encounter two common arguments against the adoption of network testing. The first one is, “This was working fine until now without a test environment.” The second common argument is, “A test environment is too expensive for the benefits it provides.”

To answer to the first argument, you must show how new practices such as automated changes increase the number of changes and features that can be implemented in a shorter span of time, and you must show that testing greatly increases the chances of success without rollbacks. A single maintenance window per year, where all the change requests are executed, is no longer enough to support modern applications’ changing requirements. With multiple changes being executed per month, or even per week in some cases, your organization can benefit from a test environment, which increases confidence in the success of these network changes.

For the second argument, yes, historically test environments were very expensive and required physical hardware. However, this is no longer the case. Although you still can acquire physical hardware and build a test environment, you are not required to because there are plenty of virtualized options.

In Chapter 5, you will learn how to install and configure EVE-NG to virtualize network topologies that can be used as a testbed. Although this is a commonly used tool for network testing, you will also learn about different options you can choose from.

Physical testbeds are still irreplaceable for some products and specific features (for example, when you are trying to load-test a specific hardware model). Nonetheless, network testing now has a lower barrier of entry, and many if not most types of functionality can be tested on virtualized network devices. If you manage a critical network, where you need the maximum amount of confidence in your changes, a physical setup that mirrors your production environment is likely still the preferred choice.

Success Criteria or Lack Thereof

In the “Decisions and Investments” section of this chapter, you learned the importance of having clearly defined success criteria. This is the biggest adoption challenge because it directly influences your ability to show successes. This challenge can manifest itself in one of two ways: lack of defined success criteria or unrealistic success criteria.

The first way is the most common. Folks embark on a NetDevOps journey without a measurable destination. They want to reap the benefits of automation and orchestration, so they set out on their way and make some initial investments; however, they end up quitting before they reach the point of seeing positive returns. Figure 2-13, earlier in this chapter, plots the relationship between investment and benefits. Other folks actually achieve successes, but without a goal or criteria to measure them against, so they end up being shut down by management, who do not understand what was actually achieved.

The second way this challenge manifests is when folks define success criteria that either are too ambitious or are cheatable. If you set success criteria that people can obtain by cheating, you may perceive you are getting benefits when in reality you are not. This contributes to resource waste and bad decisions.

An example of a cheatable criterion is one that measures success by achieving ten network changes a month. However, folks can break a single change into smaller changes, meaning one traditional network change can transform into the needed ten. Therefore, consider how “cheatable” your defined criteria are.

When you set success criteria too high, you might never achieve them. This is acceptable if you understand the context of these criteria or the way the criteria were set up. However, oftentimes senior stakeholders are not aware of either, and they simply look at these criteria as yes/no boxes. If you fail to meet the success criteria, your NetDevOps initiative might be shut down.

Defining success criteria might seem like wasted effort at first, and you might face resistance when you propose defining them. Therefore, explain the “why” behind this choice and how it will contribute later in the adoption journey when everyone understands the progress made.

New Skillset

You learned how NetDevOps requires not only networking skills but also automation and orchestration skills, which are not a natural evolution of networking. Although most folks who embark on the NetDevOps journey are completely aware of this fact, skills are usually still a big challenge in adoption.

You can face two types of challenges: folks not wanting to be upskilled and the organization not wanting to invest in the adoption of this new skillset, either through upskilling or hiring.

For the first challenge, there is not much you can do other than to adopt a hiring tactic instead of upskilling. You might find individuals who simply do not want to learn these new verticals and stick to traditional networking, and that is completely fine. You should not force them; instead, apply a different tactic.

For the second challenge, you can invest in explaining to the organization how these skills are different from the former networking landscape and what benefits they will bring. Many times, this challenge comes from the wrong understanding that NetDevOps is an evolution of networking and therefore the same skillsets apply.

CASE STUDY: SKILLS EVOLUTION AT A BANK

YetAnotherBank is a large bank with a presence in multiple countries across multiple continents. Because of its global presence, YetAnotherBank has a very large install base of multiple types of equipment from various vendors.

Banks, because they are part of the payment industry, often have a very high aversion to risk. This also applies to networking changes, which are typically kept to a minimum. Often, these limitations have led banks to have outdated software releases running on their hardware or suboptimal configurations.

YetAnotherBank is a good example of a traditional organization. It only allowed network changes once every quarter, unless they were very critical or were fixes in the case of an outage. Moreover, any change required a chain of approvals with well-documented procedures and justifications.

This change process was put in place because, in the past, YetAnotherBank had production-level outages related to manual changes that were improperly applied.

Some stakeholders of YetAnotherBank wanted to adopt NetDevOps practices for network changes on their managed data center in order to decrease the risk associated with operator manual network changes and increase the number of changes from once a quarter to at least once a month. Naturally, they faced challenges. The most interesting challenge was the lack of support to evolve the networking team’s skills.

Senior management for YetAnotherBank did not understand why evolving the network practices would require so much investment in retraining their current teams and/or hiring externally. Even when the team explained the benefits that NetDevOps adoption would bring, especially the reduction of risk for network configuration changes, the senior stakeholders were not supportive.

The data center team decided to procure automation and orchestration talent internally—inside the company but outside of the networking team—in order to produce a small-scope PoC to showcase the benefits instead of describing them.

Luckily, YetAnotherBank is a very large organization, so it had multiple software development teams, and it also had an internal program that allowed employees to devote a percentage of their time to activities outside their job responsibilities.

The team found four developers who agreed to help with the PoC. In one month they created an automated pipeline that delivered automated configuration changes to a specific device type; furthermore, the pipeline used network device virtualization to test the changes before applying them to the target device. The team was very happy with the results and started using this pipeline on their quarterly maintenance window.

More than using the pipeline, the team gathered metrics on the executions, such as average time to deploy a change, number of rollbacks, number of outages, and so on. With this data, they went back to the senior management team and showed them the data on these improvements.

The difference that these four developers made was very clear, and it was enough to convince the senior management team to fund a NetDevOps practice team within their geography.

Some months later, YetAnotherBank took its NetDevOps practice team further and made it a global team that enables local teams to improve their traditional networking processes.

Adding to this challenge is the lack of the needed skillsets in the market. NetDevOps is a mix of different domains and is still a relatively new trend. This, combined with a very competitive labor market, makes finding the right skillset typically very challenging. This also applies to the retention of talent when an upskilling tactic is adopted. Consider this when you find the right candidate or when you are trying to retain that special NetDevOps engineer.

Summary

In this chapter, you learned several use cases where NetDevOps improves the current state of the art of networking operations:

Provisioning
Configuration
Data collection
Compliance
Monitoring and alerting
Reporting
Migrations
Troubleshooting
Combined

In Chapter 4, you will see code implementations of these use cases, together with real-life examples.

NetDevOps adoption is a journey. In this chapter, you learned how you can start that journey, by prioritizing starting points that usually work well, such as focusing on solving your use cases instead of following market trends as well as prioritizing skills over technologies. You also learned the different characteristics tools can have and how those characteristics can impact your tool choice.

This chapter finished with the common pitfalls and challenges organizations and teams suffer from during their NetDevOps adoption journey, along with recommendations on how to mitigate or circumvent them.

Now that you know the theory of NetDevOps, it is time to dive deeper into the specific components. In Chapter 3, you will dive deep into the orchestration component of NetDevOps and learn how Jenkins implements CI/CD logic.

Review Questions

You can find answers to these questions in Appendix A, “Answers to Review Questions.”

What is the predominant configuration tool in networking use cases?
1. Terraform
2. Ansible
3. Python
4. Chef
What stage is common to all NetDevOps pipelines?
1. Code
2. Security verifications/linting
3. Deploy to Dev
4. Deploy to Prod
What is the most common data collection pipeline trigger?
1. Manually executed by an operator
2. Automatic by another pipeline
3. Automatic by a change in the source control
4. Automatic on a schedule
How many actions can you trigger using alarms?
1. One
2. Ten
3. Fifty
4. Unlimited
In NetDevOps pipelines, when an action can fail but is idempotent, which characteristic should you configure your stage with?
1. Rollback
2. Retry
3. End
4. Linting
What term is used in machine learning to describe the stage when a model is producing predictions?
1. Inference stage
2. Prediction stage
3. Training stage
4. Production stage
What is typically the biggest investment when adopting NetDevOps practices?
1. Purchasing DevOps tools
2. Acquiring new skills
3. Purchasing networking equipment
4. Hiring external consultancy
What should you prioritize first when starting your NetDevOps adoption journey?
1. Existing skillsets
2. The same technology stacks
3. Solving current challenges
4. Defining success criteria
Which of the following is not a type of NetDevOps tooling?
1. CI/CD orchestration
2. Infrastructure as code
3. Monitoring
4. Cloud
Do virtual network devices support all features needed for testing?
1. Yes
2. No