Competitive Software Engineering
Competitive programming (CP) is a superb way to enter the world of computer science by solving challenging problems involving advanced data structures and algorithms. Furthermore, it can prepare candidates for interviews that include similar problem-solving activities, although with less complexity. CP can help you become a software engineer ready to compete in achieving excellence in the software industry. Nonetheless, there are many differences between CP and software engineering. Some of these are mentioned here. This post tries to illuminate another crucial difference, the one that has huge economic impact, too.
To make the point clear let us use a concrete example from Petlja, a website maintained by the Serbian Mathematical Society, which also organizes official competitions in mathematics and programming. The problem we will use is called Dictionary. Here is its description translated to English:
Problem Description
Little Joe has decided to create his own dictionary composed of lexicographically ordered words. As he adds new words the dictionary could become very large, and he needs your help in answering how many words are less than some given word at any moment in time.
Input
The first line contains an integer N (1 ≤ N ≤ 100,000) that represents the number of commands. Each subsequent row is a command of the following form:
ADD word
LESS word
The first adds a new word to the dictionary, whilst the second one is a query about how many of them are less than the provided word.
Every word will be solely consisted from letters of the English alphabet, but they should be treated in case insensitive fashion. Each word will have maximum length of 100 and the total number of letters inside a dictionary will not exceed 2,000,000.
Output
For each query, output an integer or the text 'no such word'. See an example here (look inside the section called Primer 1.).
Constraints
Memory limit is 64MB and time limit is 1 second. In 50% of cases N ≤ 1,000.
Basic Version
The kind of a brute-force solution in C++ (de facto standard CP language) is quite straightforward. We simply store words in a sorted set and use binary search to answer queries. Below you can see the full listing:
#include "algorithm"
#include "iostream"
#include "set"
using namespace std;
int main() {
ios_base::sync_with_stdio(false);
cin.tie(nullptr);
cout.tie(nullptr);
int n;
cin >> n;
set<string> dictionary;
for (int i = 0; i < n; i++) {
string command, word;
cin >> command >> word;
// Converting all words into lowercase.
transform(word.cbegin(), word.cend(), word.begin(), [] (const unsigned char c) {return tolower(c);});
if (command == "ADD")
dictionary.insert(word);
else {
if (dictionary.find(word) == dictionary.cend())
cout << "no such word";
else
cout << distance(dictionary.cbegin(), dictionary.lower_bound(word));
cout << endl;
}
}
return 0;
}
This version will produce a time limit exceeded (TLE) condition, since its simplified runtime complexity is O(N2) due to the usage of distance over a set's iterator. Notice that for a more accurate expression we would need to differentiate between number of ADD and LESS commands.
Where to Go Next?
All depends on data. A golden rule of optimization says that we should always consider production data. This is where industry and CP depart a lot, as depicted on the next figure.
Figure 1. - Relationships between various versions
Industry Optimized Variant
A dictionary is composed of words approximately uniformly distributed based upon their first character. Moreover, queries would reference words sharing the same trait. Hence, a simple tactic is to reduce the size of a set that we use by partitioning words according to their first letter, a classical data partitioning scheme. Despite retaining the asymptotic behavior this version will be considerably faster than the basic variant; the big-O notation is quite clumsy and constant factors do matter in practice. Interestingly, at the time of this writing, it will be even good enough to pass all test cases on Petlja, although in a more stringent CP setup it should fail with TLE (see the next section for more information). For the sake of completeness here is the full listing:
#include "algorithm"
#include "iostream"
#include "set"
#include "unordered_map"
using namespace std;
int main() {
ios_base::sync_with_stdio(false);
cin.tie(nullptr);
cout.tie(nullptr);
int n;
cin >> n;
// Holds partitions of words based upon their first letter.
unordered_map<unsigned char, set<string>> dictionary;
for (int i = 0; i < n; i++) {
string command, word;
cin >> command >> word;
// Converting all words into lowercase.
transform(word.cbegin(), word.cend(), word.begin(), [] (const unsigned char c) {return tolower(c);});
const unsigned char pivot = word.front();
if (command == "ADD")
dictionary[pivot].insert(word);
else {
if (dictionary[pivot].find(word) == dictionary[pivot].cend())
cout << "no such word";
else {
int less = 0;
for (unsigned char c = 'a'; c < pivot; c++)
less += dictionary[c].size();
cout << less + distance(dictionary[pivot].cbegin(), dictionary[pivot].lower_bound(word));
}
cout << endl;
}
}
return 0;
}
CP Optimized Variant
In CP you are forced to come up with a qualitatively improved version having a better asymptotic behavior. In our case, a "winner" test case would contain words all starting with the same letter. Any data partitioning approach may be thwarted with a "proper" test case. To combat skewed data sets we may apply various strategies, like leveraging something called a policy-based data structure (PBDS); it is part of an undocumented corner of g++. Using non-standard stuff is considered a bad practice, but it is OK in the context of CP. Here is the new version with O(N * W * log N) runtime complexity, where W is the max. length of each word:
#include <ext/pb_ds/assoc_container.hpp>
#include <ext/pb_ds/tree_policy.hpp>
#include <functional>
#include <iostream>
using namespace __gnu_pbds;
using namespace std;
typedef tree<string, null_type, less<string>, rb_tree_tag, tree_order_statistics_node_update> OrderedSet;
int main() {
ios_base::sync_with_stdio(false);
cin.tie(nullptr);
cout.tie(nullptr);
int n;
cin >> n;
OrderedSet dictionary;
for (int i = 0; i < n; i++) {
string command, word;
cin >> command >> word;
// Converting all words into lowercase.
transform(word.cbegin(), word.cend(), word.begin(), [] (const unsigned char c) {return tolower(c);});
if (command == "ADD")
dictionary.insert(word);
else {
const int i = dictionary.order_of_key(word);
if (i < dictionary.size() and *dictionary.find_by_order(i) == word)
cout << i;
else
cout << "no such word";
cout << endl;
}
}
return 0;
}
Conclusion
Apparently, the industry and CP optimized versions are quite different, even though they follow the same logic. PDBS is not part of the C++ standard library and may not be available on all platforms. Furthermore, it is considered an advanced stuff. Any of these sorts of complexities would surely increase the maintenance cost of a software. This is the reason why approaching software development activities on large software systems found in an industry is fundamentally different compared to writing a CP solution.
Another key takeaway is that you must always tune performance based upon production data and use cases. Tweaking a software to cover all strange edge cases makes it unnecessarily complex, error prone, and expensive. CP experience and knowledge gained there is surely valuable, but software engineering is much more than just solving problems to attain best possible score.