BSc/MSc projects - David Clark

All of these project suggestions have connections to Software Engineering, Program Language Theory, and Language Based Security. The list is not complete but is offered to give the flavour and breadth of my current interests.

1. software testing and information theory

Measure incidence of Failed Error Propagation for real faults and seeded faults, Compress programs and look for correlations with how difficult they are to test. Improve software oracles. use information to improve diversity of inputs, outputs and everything in between.

Investigate analysing Failed Error Propagation in security testing

2. detecting malware using a language model

A language model gives the conditional probability of a string symbol occuring given a preceding string. If the preceding string is n symbols long it is called an n-gram. Linguistics and Natural Language Processing researchers have been using language models for some time now. One thing that they have learned is that a language model can be useful even if the size of the n-grams considered only varies from 2 to 7. Language models can be constructed for any corpus of strings and can be used to distinguish one corpus from another, for example the works of Shakespeare from the works of James Joyce. In the last couple of years models have been constructed for program corpora. One of my RAs has now constructed some language models that detect malware.

Topics in this area:

Construct language models that classify malware binaries.
Use a compression algorithm to construct a language model. Compare this with a conventional language model. Is it more accurate? Is it smaller? Is it cheaper to construct?

3. NCD and metamorphism

Normalised Compression Distance (NCD) is an approximation to Normalised Information Distance. The latter is based on Kolmogorov Complexity, K, the algorithmic measure of the information in a (string) object. For a string x, K(x) is the length of the shortest program that can produce the string x. In general, K(x) is not decidable. In the early 2000’s Cilibrasi had the leap of faith that K(x) could be approximated by a compression program (or compressor), C, and that for practical purposes C(x), length of the compressed string, could be used for K(x).

For two strings x and y, NCD(x,y) = ( C(xy) - min{C(x),C(y)} )/ max{C(x),C(y)}

This has been applied to many kinds of strings but we are interested in applying it to binary executables.

Different compressors behave differently. We have discovered that the size of the compressor input window is a key configurable. In our recent experiments we have used 7zip with a 2GB window. The most interesting result we currently have is a statistically robust 97% accuracy in detecting malware via NCD values feeding into a machine learning classifier. As we have similar results using clustering with the k-medoids algorithm it is clear that it is NCD which is doing the heavy lifting.

Topics in this area could include:

Do similar experiments but on source code and on executable binaries rather than Java byte code. This requires finding metamorphic engines for other languages, e.g. C/C++.
Investigate how metamorphic engines are used in practice, especially by malware writers. Design experiments to reflect this.
Investigate whether repeated application of an engine to different programs makes the resulting programs converge with respect to similarity.

4. Other NCD experiments

There are lots of projects that could be done using NCD. You might have your own idea. I certainly have some more ideas.

One of these is a very “software engineering” project: using NCD to look at software version evolution. Here, we identify a corpus of open source software and use NCD and compressibility as metrics to learn about the pace and direction of development of new versions of the software. This could be security patches of Linux libraries, for example.
Another is to look at the correlation between NCD and size. NCD is “normalised” to allow strings of different length to be compared but it is still possible that size could be a good predictive factor for similarity. In fact, it would be useful to rule out size as a factor in compressibility as well.
Look at the use of dictionaries and other “partial compressions” to both speed up NCD and develop methods for small strings. Contemporary compressors are heading in this direction already … . .

5. Finding triggers for malware behaviours

Existing work in this area is limited and relies on symbolic execution. It would be interesting to investigate Monte Carlo methods and how to speed these up.

The student could focus on synthetic programs containing timing triggers:

write or collect some simple programs with branches whose entry is controlled by “clocks”
implement the method.
run experiments to compare the speed of the method with that of random input selection from a uniform distribution.

Applications go beyond malware behaviour triggers to finding vulnerabilities in code to a general search method for dynamic behaviours in software.

6. Metamorphic programming via gadgets

Return Oriented Programming uses a technique for gadget harvesting that has been adopted and adapted into a general metamorphic approach by Mohan and Hamlen. These new techniques have the potential to significantly tilt the malware arms race in favour of the malware writers in the future. In this project we will study what is known about Mohan and Hamlem¡¯s approach and try to replicate it in order to create benignware. The loose aim is to understand the technique and to speculate on detection methods.

7. Information Flow Security

The problem is how to engineer software so that the information flows within the software obey the required security policies. For example, a program analysis that shows that a web site that says it won’t keep your credit card once you use it, really can’t keep your credit card. These applications involve some combination of theory of security, program analysis, verification tools, and tool construction. Here are two example projects:

A: Covert channels in the cloud.

Covert channels are lines of communication which should not exist but, actually, do. Information can flow along these channels. For a while it has been assumed that, although it is possible for such channels to exist between different virtual machines on the same physical architecture, the bandwidth of such channels are low and so not worth worrying about. However in a 2012 paper an attack via the physical cache was descibed in which the bandwidth is significant. The task in this project is to, via programming, replicate the attack described in the paper and then use this knowledge to be creative about ways to mitigate this type of attack.

Paper: Whispers in the Hyper-Space: High-speed Covert Channel Attacks in the Cloud. Zhenyu Wu, Zhang Xu and Haining Wang. Usenix Security Symposium 2012.

B: A dynamic way to “statically” demonstrate flow security in software.

As discussed at length in the Language Based Security module some effort has to go into devising analyses of programs that show whether or not a given program has the noninterference property. Since these analyses use abstractions of the program semantics they can sometimes identify secure programs as insecure. But what if you could dynamically check for the noninterference property by just running the program once? and no abstraction meant you could be exact about which programs have the property and which do not? Well some one has shown how to do this in a paper published in 2012. It is called Multi-Program Execution (MPE). The task in this project is to devise a framework for MPE and program it, run tests on it, understand its weaknesses and, especially, try to understand how to extend it to nondeterministic programs, e.g. ones with embedded key generation. At present it only works for deterministic programs.

Paper: Noninterference through Secure Multi-execution. Dominique Devriese and Frank Piessens. Security and Privacy 2010

F: Experiments in Quantified Information Flow.

Recent research has developed a syntax directed analysis for a non-standard process language. The analysis produces a matrix of conditional probabilities for the way in which low security event sequences depend on high event sequences. Implementing this as a demonstration, proof of concept tool would require solving these problems:

implementing the Binary Differential Equations which extract the matrix from the syntax,
calculating QIF from the matrix
(optional) implementing one or more automatic translations from fragments of existing languages to the target non-standard one.

Paper: M. Boreale, D. Clark, and D. Gorla. A semiring-based trace semantics for processes with applications to information leakage analysis. In Proceedings of the 6th IFIP TC 1/WG 2.2 International Conference TCS 2010, Part of WCC2010 Proceedings. Springer, 2010.

G: Enhancing an existing QIF analyser.

Chunyan Mu has produced a proof of concept QIF analyser for a simple imperative language. This is used to automatically calculate the amount of information leaked to a low security user by running the program. The operation of the program could be enhanced in various ways. In particular the tool relies on the user creating a bespoke partition of the inputs in order to achieve speed ups in calculating leakage. A more useful interface would be to allow a user to select points in a lattice of possible partitions. How this is done could greatly influence the execution speed of the tool. There are other enhancement possibilities as well such as implementing translators into the target language. For a really ambitious student there is scope to research the analysis of additional program constructs.

Paper: Chunyan Mu and D. Clark. A quantitative analyser for programs in a core imperative language The Proceedings of the 8th International Conference on Quantitative Evaluation of Systems (QEST), 2011.

8. Security testing

At present there has been little research into a semantic approach to security testing. The place to start is white box testing as in this scenario the source code is available for analysis. Any formal semantics of a program will be based on how the program updates state.

Integrating security testing with test suite effectiveness.

Security testing often uses known types of attacks to develop a suite of similar attacks for a new program. Testing research suggests that the effectiveness of a test input is influenced by whether an infected execution state, one which is incorrect because of an error, affects the output from the test input. The task in this project is to understand the relationship between security tests and infected states so that current research into test suite effectiveness can be applied to security testing.

Paper: D. Clark and R. Hierons Squeeziness: A Information Theoretic Measure for Avoiding Fault Masking. Information Processing Letters, Volume 112, Issues 8 – 9, 30 April 2012,