Back in February, I rewrote the
compare()
function of my Perl script
duplicatecode.pl in C using
Inline::C. The original pure-Perl version ran in 8 hours, 38 minutes, and 27 seconds, measured with NYTProf.
[1] The C version runs in 5 minutes and 10.699 seconds, measured with the
time
command in
bash.
Times for one run each
Time for pure-Perl compare: 31107s (81765s)[2]
Time for C compare: real 5m10.699s user 4m23.250s sys 0m18.813s
Why is the pure-Perl version so slow? Because substr
has to do a bunch of checks on the type of variable it is accessing and validate arguments, whereas the C version only needs to do a dereference to get the character at a certain position in a string.
Here are the pure-Perl and C versions:
#pure-Perl
sub compare {
my ($line1, $line2) = @_;
#Both blank strings
if ($line1 eq '' && $line2 eq '') {
return 1.0;
}
#Prevent division by zero when dividing by shortest string later on
elsif ($line1 eq '' || $line2 eq '') {
return 0.0;
}
my $same = 0;
my $total = length($line1) + length($line2);
my $shortest = length($line1) < length($line2) ? length($line1) : length($line2);
my $longest = length($line1) > length($line2) ? length($line1) : length($line2);
my $i = 0;
while ($i < $shortest) {
if (substr($line1, $i, 1) eq substr($line2, $i, 1)) {
$same++;
}
else {
for my $j ($i + 1 .. ($shortest-1)) {
if ((substr($line1, $i, 1) eq substr($line2, $j, 1)) ||
(substr($line1, $j, 1) eq substr($line2, $i, 1))) {
$same++;
$i = $j;
last;
}
}
}
$i++;
}
return $same / $longest;
}
__END__
__C__
/* C version */
double compare(char* line1, char* line2) {
/* Both blank strings */
if (strcmp(line1, "") == 0 && strcmp(line2, "") == 0) {
return 1.0;
}
/* Prevent division by zero when dividing by shortest string later on */
else if (strcmp(line1, "") == 0 || strcmp(line2, "") == 0) {
return 0.0;
}
int same = 0;
int total = strlen(line1) + strlen(line2);
int shortest = strlen(line1) < strlen(line2) ? strlen(line1) : strlen(line2);
int longest = strlen(line1) > strlen(line2) ? strlen(line1) : strlen(line2);
int i = 0;
while (i < shortest) {
if (line1[i] == line2[i]) {
same++;
}
else {
for (int j = i + 1;j < shortest - 1;j++) {
if (line1[i] == line2[j] || line1[j] == line2[i]) {
same++;
i = j;
break;
}
}
}
i++;
}
return same/(longest + 0.0);
}
The code I used for profiling the pure-Perl version was:
~/src/glob2> perl -d:NYTProf ../duplicatecode/duplicatecode.pl --compareself libgag/src/*.cpp libusl/src/*.cpp libwee/src/*.cpp src/*.cpp
And for the Inline::C version:
~/src/glob2> time perl ../duplicatecode/duplicatecode.pl --compareself libgag/src/*.cpp libusl/src/*.cpp libwee/src/*.cpp src/*.cpp
This was measured on bash on openSUSE Leap 15.4 on WSL1 on Windows 10. Hardware used was an MSI GS63 Stealth with the second drive (D:) replaced with 2TB SSD.
The code I analyzed for duplicate code was Globulation 2 master-0.9.5.0-7c23bf87.
As you have seen in this post, rewriting a time-critical subroutine called from the inner loop in C with Inline::C can significantly speed up overall script execution time. However, it requires a C compiler to be available, so it is not completely portable. Neverless, optimizing slow code is good if it means you wait only minutes rather than hours for the results.
My perl version is:
> perl --version
This is perl 5, version 26, subversion 1 (v5.26.1) built for x86_64-linux-thread-multi
Copyright 1987-2017, Larry Wall
Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.
Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.
Footnotes:
1. Devel::NYTProf seems to be unable to measure the speed of Inline::C code.
2. According to Devel::NYTProf, which lists the time counted when profiling and total execution time in parentheses.
3. Yes, I know the C version won't work correctly with strings with embedded NUL bytes ('\0'
), but this script is intended to analyze source code rather than binary data.