Security Predictions 2016: A Data Analysis Approach

Author: Daniel Schatz, CISM, CCSK, CISSP, CSyP, CVSE, ISO 27001 LA/LI, MCITP-EA
Date Published: 4 August 2016

The topic of information security has evolved to one of the top concerns among policymakers and corporations. Leaders demand answers from their support structures as to how such risk can be effectively managed.¹ In contrast to, for example, the physical security space, answers on risk, impact and cost are not as straightforward.

This is due to the rapid developments in this space that leave subject matter experts with very limited historical data that support reliable risk models and even fewer options to anticipate future developments. In the absence of such data, information security professionals rely on subject matter knowledge (expert judgement), assumptions, vendor recommendations and industry best practices such as the US National Institute of Standards and Technology (NIST) Framework for Improving Critical Infrastructure Cybersecurity² to try and manage information security risk. Nevertheless, managing risk from a position of uncertainty is uncomfortable; inputs that aid in reducing ambiguity or even predict future developments can be of value.

This is not unique to information security, of course. An increasing body of research across multiple domains is developing around the advantages and disadvantages of predictions and forecasts.³^,⁴^,⁵^,⁶ However, forecasting and predictions are viewed rather critically by media⁷^,⁸ and subject matter experts (i.e., information security professionals) alike. Regardless of this scepticism, a constant stream of predictions is published, mainly by vendors, every year. This article asserts that these predictions should be considered useful information rather than marketing noise if they are read with critical thought and a bias consciousness mind-set.

This article looks at published security predictions for the year 2016 collected from public sources over the period of October 2015 to January 2016. This is a high-level overview of underlying themes based on a manual categorisation approach. Analysis of the prediction pool utilising co-occurrence networks and topic modelling with Latent Dirichlet Allocation (LDA)⁹ provides a second, unbiased view of the underlying themes.

Data Collection

The collection of security predictions was conducted through simple search alerts, manual review of press releases, vendor notifications and revisiting sources known from previous years. Only those predictions that are considered relevant have been included. Relevancy was defined as ‘informed opinions or assumptions on developments in the information security threat landscape throughout 2016 expressed as forecast or prediction’. Data collection was done on a best-effort basis and the coverage here is not claimed to be complete. However, the prediction data set covers an exhaustive 238 individual predictions from 41 sources.

Security Predictions 2016

The first step of the analysis distinguished between predictions that discuss an expected change in the security threat landscape (e.g., ‘Increased targeting of Apple devices by cybercriminals’) and those that provide general opinions on developments in the information security industry (e.g., ‘More CISOs will be hired’). For the scope of this research, the focus is on ‘true’ security threat predictions rather than general developments. This reduced the data set to 187 predictions.

The next step was to categorise each prediction to align with one of 15 threat categories. These 15 high-level categories included threats associated with areas such as cloud computing, insider threats, malware and state-sponsored attacks (figure 1), and were originally defined in 2013. The categories are based on previous work by members of the Financial Services Information Sharing and Analysis Center (FS-ISAC) community and supplemented based on threat developments in 2013. As with any categorisation attempt, the problem of defining too few or too many categories is a valid topic of discussion. An alternative view on this is addressed in a later section of this article.

An overview of the categories and their popularity in terms of 2016 predictions is shown in figure 1. This figure provides a high-level overview of the topical areas with noticeable developments expected in the 2016 security threat landscape according to these sources.

These data show the direction from which threats may develop in the coming months, but it is important to consider the source of the predictions. Good results can be achieved by combining forecasts from eight to 12 experts whose knowledge of the problem is diverse and whose biases are likely to differ.¹⁰ With this in mind, it is crucial to investigate whether prediction areas are supported by multiple predictions made by only one vendor or if, indeed, there is a wider consensus.

Figure 2 shows a detailed breakdown of the prediction distribution by vendor. This view allows further critical analysis of threat category predictions based on the source of predictions. The illustration provides a complete overview of the prediction sources (i.e., the vendors) and the number of predictions each vendor contributes to each category (block size). For example, IBM contributes several predictions on denial of service (DoS) but nothing on any other category, whereas AT&T covers a variety of categories with its predictions for 2016. It becomes immediately apparent that threat developments in the category Internet of Things (IoT) are a widespread concern across most sources. Conversely, only one source (IBM) is driving concerns in the DoS category. However, in general, the predictions for 2016 appear to be a balanced distribution across categories and sources.

Based on the analysis so far, it is simple to deduce that the sources indicate noticeable developments, particularly in the areas of IoT, organised crime attacks and malware, but how does this compare to previous years? Figure 3 compares the predictions from 2014, 2015 and 2016 in an attempt to understand which areas are more of the same and which areas are new developments.

View Large Graphic

This simple visualisation gives an indication of where new threat developments are expected. According to the sources, relatively stable threat development (as measured by average prediction count) in the areas of DoS, insider, malware, organised crime attacks, social engineering and human aspects, social media, state-sponsored attacks, supply chain issues, and general vulnerability management should be expected. Surprisingly, a drop in predictions for all things related to mobile workforce/malware is noted. In contrast, hacktivism, IoT and regulatory changes are obviously strong emerging areas of concern in the coming months.

While most of this will make intuitive sense to information security professionals, a few points that may impact the validity of these findings should be highlighted. Potential shortcomings in the data collection process have been mentioned. Obviously, inclusion of additional predictions may result in a different predicted threat landscape. However, the data set is believed to be sufficiently representative and balanced so as to provide a useful overview.

An additional issue lies with the sources themselves. Few, if any, of the predictions are made following rigid processes¹¹ and are likely to be vulnerable to bandwagon or current events biases.

There is also the previously mentioned issue of categorisation. The 15 categories used over the last three years of data collection may not be the best fit today. If this process started over today, it might include a slightly different category selection. However, for consistency, the established categories are used here. Aligning individual predictions with a predefined category is often not straightforward because sources may cover various aspects in one distinct prediction. Consequently, this process is forced to apply a subjective ‘best fit’ approach. Recognising these challenges, the decision was made to investigate an alternative view on the data set that is largely unbiased, but requires more effort to interpret the results.

2016 Predictions by the Numbers

It is important to have a basic understanding of the key terms in the prediction data set. The software environment for statistical computing ‘R’ provides a useful platform for this because the data set can be imported and the text mining module tm_map can be used.¹² The usual corpus preparation tasks (remove punctuation, strip white spaces, convert to lowercase and remove stop words) except stemming¹³ were applied and a (sparse) document term matrix (DTM) was created. The DTM is a way of structuring documents and the terms used within the documents.

In essence, the matrix shows frequencies of terms across all documents in scope. In this case, the individual predictions represent the documents; because terms are not used in all predictions, the matrix was reduced by removing those terms that were rarely used. This enabled the calculation of a correlation matrix for the key terms across the definition data set (figure 4). It can be seen that there are some relations forming, for example, Internet/devices, business/information/data/target, and attackers/malware.

While this provides some basic insights, it is a rather limited overview that does not lend itself to drawing deep conclusions on the underlying context. To get a more meaningful view of the contextual relationships inherent to the data set, co-occurrence network analysis¹⁴ was used. In textual analysis, co-occurrence networks show words with similar appearance patterns and, as such, with high degrees of co-occurrence. The approach is based on the idea that a word’s meaning is related to the concepts to which it is connected. It also has the benefit that there is no coder bias introduced other than to determine which words are examined.¹⁵

However, network graphs can get too crowded unless sensible restrictions are applied. By filtering out terms with frequencies of <15 when producing the co-occurrence network graph, the information presented was reduced while preserving important context. Figure 5 presents an at-a-glance view of the important underlying concepts inherent to the words used in the prediction set.

As mentioned previously, there can be a large number of tenuous relationships between terms. To enhance readability of the network graph, showing only the minimum spanning tree (MST) is a way of indicating which edges are important and focusing on those. In addition to reducing the display by the MST, community detection was added to further emphasise connected components. Community detection is typically used to identify highly connected groups of vertices in complex networks. This kind of structure emphasises information about the network such as underlying topics, which is what is being looked for in this case.

The node size illustrates the term frequency and detected communities are highlighted in different colours. Based on the data set, it appears that the ‘random walk’ or ‘walktrap’ algorithm¹⁶ provided the best (subjective) community detection approach. Combined with MST, it aids in understanding not only the key concepts, but also how words are grouped into communities and which communities are close to each other.

Looking at the communities (colours) in figure 5 reveals surprisingly coherent topics. Some of them are not far off from the manual approach such as the IoT (purple), ransomware (pink) or general organised crime (green) topics. But there are additional topics of interest that were not quite as obvious previously; the prediction sources highlight areas of concern with health care incidents and industry insurance policies (red), social media (dark purple), transport layer encryption (orange) and malicious vendor code (yellow).

At this point, there is a better understanding of predicted developments in the 2016 threat landscape, but, ideally, an unbiased identification of all underlying topics inherent to the data set should be conducted. One of the possible ways to do this is the use of topic models. Topic models are basically algorithms helping to discover latent themes that underlie a large and otherwise unstructured collection of sources.¹⁷ What this means in this case is that one of these algorithms can be leveraged to aid in bringing the latent topics in the prediction data set to the surface.

This research used LDA¹⁸ to find the ‘latent’ prediction topics. LDA is a fairly complex concept, but the intuition behind it is quickly explained. For a given collection of documents, one assumes a range of underlying topics described by the terms used within the documents. Each topic is treated as a probability distribution over terms so one can view a document as a probabilistic mixture of these topics. LDA applies a statistical model to assign terms to those topics. However, initially, the number of topics one is looking for is not known. To understand the optimal number of topics in this data set, the harmonic mean approach¹⁹ was applied, which is based on mathematical averages approximated from a specified multivariate probability distribution. For this data set, 17 topics were found to be the optimum. The manual categorisation approach with 15 topics was seemingly not too far off. Figure 6 shows all 17 topics with the first seven words associated to each topic. It lists each identified topic (1-17) with the associated terms in the same row with declining relevance from left to right. For example, the first row represents an identified topic for which the terms ‘devices’, ‘IoT’, ‘Internet’, ‘connected’ and ‘security’ have high relevance. As this is an automated approach, not every topic makes immediate sense, but by and large, the ‘headlines’ paint a surprisingly clear picture. Again the same threats are seen, including IoT, cyberbreach insurance risk, hacker targeting social media/news, transport layer encryption issues, vulnerabilities in mobile apps and ransomware.

To visualise the newly identified topics, a tool called LDAvis²⁰ was used. The tool creates interactive web-based visualisations of topic models that have been fit to a corpus of text. Figure 7 illustrates the output based on the IoT topic (figure 6, row 1); the distance map on the left shows that the topic is strong, with limited overlap of term association with other topics, i.e., the topic is described by a fairly unique set of terms compared to other topics.

View Large Graphic

This is further highlighted by the frequency of terms within this particular topic (purple) and across all topics (blue) as shown to the right of the chart. Both ‘devices’ and ‘IoT’ are almost exclusive to this one topic. As an added benefit, an easy view on other important terms within this topic is provided, allowing for a deeper understanding of the prediction. Taking note of the terms ‘smart’, ‘consumer’, ‘health care’, and ‘medical’, two specific areas where the sources predict trouble are clearly understood. However, further analysis on each topic is out of scope for this article.

Conclusions

The information security threat landscape is a fast-developing space and poses considerable challenges for security programmes in most organisations. Security professionals have limited tools to try and anticipate potentially critical changes in the threat landscape that may impact their strategy; yet predictions on these developments are often humored and discarded. This article argues that security professionals can and should make use of these data points (wisdom of the crowd) to refine their protection strategies as long as they are applied with critical thinking and a bias consciousness state of mind.

Possible approaches on how to conduct such an exercise using an example of an exhaustive collection of security predictions for the year 2016 were given. This article illustrated how a simple manual categorisation leads to quick and useful results. It further showed an automated, unbiased, text analysis and topic modelling approach that enables professionals to gain a deeper understanding. It is hoped that both the results and the methodology serve colleagues in this space as inspiration to improve their information security risk management practice.

Endnotes

¹ Clinton, L.; Cyber-Risk Oversight Handbook, National Association of Corporate Directors, 10 June 2014, https://www.nacdonline.org/Resources/Article.cfm?ItemNumber=10688
² National Institute for Standards and Technology, Framework for Improving Critical Infrastructure Cybersecurity, USA, 12 February 2014, www.nist.gov/cyberframework/upload/cybersecurity-framework-021214-final.pdf
³ Armstrong, J. S.; ‘The Seer-Sucker Theory: The Value of Experts in Forecasting’, Technology Review, 1 June 1980, p. 16-24
⁴ Armstrong, J. S.; K. C. Green; A. Graefe; ‘Golden Rule of Forecasting: Be Conservative’, Journal of Business Research, vol. 68, iss. 8, August 2015, p. 1717-1731
⁵ Leoni, P.; ‘Market Power, Survival and Accuracy of Predictions in Financial Markets’, Economic Theory, vol. 34, no. 1, January 2008, p. 189-206
⁶ Denrell, J.; C. Fang; ‘Predicting the Next Big Thing: Success as a Signal of Poor Judgment’, Management Science, vol. 56, no. 10, 7 June 2010, p. 1653-1667
⁷ Dubner, S. J.; ‘The Folly of Predictions’, Freakonomics Radio podcast, 14 September 2011, http://freakonomics.com/podcast/new-freakonomics-radio-podcast-the-folly-of-prediction/
⁸ Harford, T.; ‘Why Predictions Are a Lot Like Pringles’, The Undercover Economist blog, 12 January 2016, http://timharford.com/2016/01/why-predictions-are-a-lot-like-pringles/
⁹ Blei, D. M.; A. Y. Ng; M. I. Jordan; ‘Latent Dirichlet Allocation’, Journal of Machine Learning Research, vol. 3, 2003, p. 993-1022
¹⁰ Op cit, Armstrong et al.
¹¹ Ibid.
¹² Meyer, D.; K. Hornik; I. Feinerer; ‘Text Mining Infrastructure in R’, Journal of Statistical Software, vol. 25, iss. 5, 31 August 2008, p. 1-54
¹³ Porter, M. F.; An Algorithm for Suffix Stripping, Morgan Kaufmann Publishers Inc., USA, 1997
¹⁴ Rice, R. E.; J. A. Danowski; ‘Is It Really Just Like a Fancy Answering Machine? Comparing Semantic Networks of Different Types of Voice Mail Users’, Internationanl Journal of Business Communication, vol. 30, iss. 4, September 1993, p. 369-397, http://job.sagepub.com/content/30/4/369
¹⁵ Ryan, G. W.; H. R. Bernard; ‘Techniques to Identify Themes’, Field Methods, vol. 15, iss. 1, 2003, p. 85-109
¹⁶ Pons, P.; M. Latapy; ‘Computing Communities in Large Networks Using Random Walks’, Computer and Information Sciences—ISCIS 2005 Lecture Notes in Computer Science, Springer Berlin Heidelberg, Germany, 2005, p. 284-293
¹⁷ Blei, D. M.; ‘Probabilistic Topic Models’, Commununications of the ACM, vol. 55, no. 4, April 2012, p. 77-84
¹⁸ Op cit, Blei et al.
¹⁹ Griffiths, T. L.; M. Steyvers; ‘Finding Scientific Topics’, Proceedings of the National Academy of Sciences, vol. 101, suppl. 1, 6 April 2004, p. 5228-5235
²⁰ Sievert, C.; K. E. Shirley; ‘LDAvis: A Method for Visualizing and Interpreting Topics’, Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, USA, 27 June 2014, p. 63-70

Daniel Schatz, CISM, CCSK, CISSP, CSyP, CVSE, ISO 27001 LA/LI, MCITP-EA
Is the director of threat and vulnerability management for Thomson Reuters, working in London, UK. He is a member of the International Systems Security Association (ISSA-UK). He was an organiser of the popular BSides London security conference before stepping down in 2013 to focus on his doctoral studies at the University of East London.