String Pattern Matching and Tools for Analyzing Code
Brenda S. Baker
Research Projects
- Software tools and algorithms for analyzing source code
I developed a theory of parameterized string pattern matching described in [2, 4,6-7,11] for
"parameterized strings" (p-strings) that can contain special parameter symbols.
If two such strings are a parameterized match (p-match), it is possible to transform one into the other via
systematically replacing parameters in one string by different parameters in the other, analagous to
replacing x
and foo
by y
and fun
everywhere throughout
a region of code.
- Dup
Given source code and a threshold length set by the user, Dup [5,6,8,11] finds all pairs of maximal regions
that are over the threshold length and are either identical textually or are p-matches.
Dup can also be used to find exact duplication
in text or genomic data. A scatter plot produced by Dup for a million-line production subsystem
is shown here.
Pairs of similar sections
of code are sometimes called "clones." An analysis of results from a joint experiment with
other software tools for finding clones is given in [1].
- Pdiff
Pdiff finds the smallest edit distance
between two source files based on edit operations of insertion, deletion, and
p-match. It generates HTML to display the two files side by side
with the differences marked. The algorithm is given in [4].
- Software tools for analyzing executables without access to source code
- Identifying related executables
Compiling related Java programs can result in very
different bytecode files, when these files are viewed
as raw bytes.
In order to identify bytecode files from related source files without access to the source,
several techniques were used
to adapt Dup (above), Udi Manber's tool Siff,
and the UNIX utility Diff to deal with bytecode files [5].
Our tools make it possible to find
similarities among thousands of bytecode files, or
to compare one new file to an index of many old ones.
Possible applications include detection of plagiarized code,
software reuse, program management, and uninstallers.
- Exediff - patch files for executables
For source files, updates are commonly distributed as patches.
When source code can't be given out
or when data transmission rates are slow, it would be convenient
to distribute updates to executables as patch files as well.
The difficulty is that
when primary changes are made to source code,
secondary changes (e.g. in pointers or jumps), can propagate throughout the executable.
Exediff
uses heuristics to reconstruct secondary changes where possible
in order to keep patch files small.
For the Alpha architecture, a comparison of the sizes of gzipped patch files created by Exediff and
by the
Bindiff
utility (which compares raw bytes) showed
that Exediff generally saved a factor of two for
major version changes and a factor of five for minor
version changes [4]. I also implemented a version of Exediff for Java bytecode files.
Selected Papers on String Pattern Matching and the above projects
-
Brenda S. Baker, Finding Clones with Dup: Analysis of an Experiment,
IEEE Trans. on Software Engineering 33,9, Sept. 2007, pp. 608-621. gzipped PostScript
-
Brenda S. Baker and Raffaele Giancarlo,
Sparse Dynamic Programming for Longest Common Subsequence from Fragments,
J. Algorithms 42,2, 2002, pp. 231-254. gzipped PostScript
-
Brenda S. Baker, Udi Manber, and Robert Muth, Compressing Differences of Executable
Code, in Proc. of the ACM SIGPLAN 1999 Workshop on
Compiler Support for System Software (WCSSS'99),
1999, pp 1-10. gzipped PostScript
-
Brenda S. Baker, Parameterized Diff,
Proc. of ACM-SIAM Symposium on Discrete Algorithms (SODA), Jan. 1999, pp. S854-S855. gzipped PostScript
-
Brenda S. Baker and Udi Manber, Deducing
Similarities in Java Sources from Bytecodes, in Proc. of the USENIX
Annual Technical Conference, 1998, pp. 179-190. gzipped PostScript
-
Brenda S. Baker and Raffaele Giancarlo.
Longest Common Subsequence from Fragments via Sparse Dynamic Programming,
Algorithms: 6th European Symposium Proceedings (ESA '98), Lecture Notes in Computer Science 1461, 1998, pp. 79-90.
-
Brenda S. Baker, Parameterized Duplication in Strings: Algorithms
and an Application to Software Maintenance,
SIAM J. on Computing 26,5, Oct. 1997, 1343-1362.
-
Brenda S. Baker, Parameterized String Pattern Matching.
JCSS 52,1, Feb. 1996, 28-42.
-
Brenda S. Baker, Parameterized Pattern Matching by Boyer-Moore-type
Algorithms,
Proc. of the Sixth ACM-SIAM Symposium on Discrete Algorithms (SODA), 1995, 541-550.
gzipped Postscript
-
Brenda S. Baker, On Finding Duplication and Near-Duplication in Large
Software Systems,
Proc. of the Second Working Conf. on Reverse Engineering,
1995, pp. 86-95. Received IEEE Outstanding Paper Award. gzipped PostScript
-
Brenda S. Baker, A Theory of Parameterized Pattern Matching:
Algorithms and Applications (Extended Abstract),
Proceedings of the 25th ACM Symposium on Theory of Computing (STOC '93),
1993, 71-80. gzipped PostScript
Other research
Home
Last modified: Sat Jan 22 21:16:12 PST 2011