Summary
In the previous post, we described the first steps for installing and building ArKanjo. Today, we continued working on the project by forking the repository, fixing an installation/build issue, and starting to explore how the codebase could be extended with an embedding-based duplicate detection method.
The main practical result of the day was a small installation fix related to pthread linkage. After that, we started experimenting with a transformer encoder model for code embeddings. The long-term goal is to implement a flexible ArKanjo method that computes embeddings for functions and uses cosine similarity to identify possible duplicates.
Forking the repository and fixing the installation bug
We started by working on our fork of the ArKanjo repository. After the previous post, ArKanjo was updated with support to a new method. This introduced some bugs when building in my computer. For fixing this, we used a branch called fix-install.
While rebuilding ArKanjo, the compilation step succeeded, but the final link step failed with undefined references to pthread_create:
/usr/bin/ld: libcore_methods.a(ast_method.cpp.o): undefined reference to "pthread_create"
collect2: error: ld returned 1 exit status
This was not a CMake version problem. The error happened at link time because the code in ast_method.cpp uses pthreads, but the final executable was not being linked against the POSIX threads library.
The fix was to use CMake's standard thread package instead of manually adding linker flags. In the top-level CMakeLists.txt, we added:
set(THREADS_PREFER_PTHREAD_FLAG ON)
find_package(Threads REQUIRED)
Then, in cmake/Targets.cmake, we updated the core_methods target to link against Threads::Threads:
target_link_libraries(core_methods
PUBLIC
core_base
${PARSER_LIBS}
tree_sitter_core
Threads::Threads
)
This solved the issue.
Plan for a new embedding-based method
After fixing the build issue, we started exploring the codebase with the goal of adding a new duplicate detection method based on language-model embeddings.
The idea is to implement a flexible method that receives source code snippets, computes vector embeddings with a transformer encoder, and compares functions using cosine similarity. The first model we are experimenting with is jinaai/jina-embeddings-v2-base-code , but the final implementation should not hard-code this model. Instead, the exact embedding model should be configurable by the user.
Conceptually, the method should follow a simple pipeline:
function source code
↓
transformer encoder
↓
embedding vector
↓
cosine similarity
↓
duplicate / not duplicate decision
This is different from purely syntactic duplicate detection. Instead of comparing only tokens, trees, or handcrafted features, the method uses a pretrained model to place pieces of code in a semantic vector space. The hope is that functions with similar behavior will have nearby embeddings, even when variable names or surface syntax differ.
Setting up a CPU environment
Before integrating anything into ArKanjo, we created a small isolated environment to test the embedding model on CPU:
conda create -n jina-cpu python=3.10 -y
conda activate jina-cpu
pip install --upgrade pip
We installed the CPU version of PyTorch:
pip install torch --index-url https://download.pytorch.org/whl/cpu
Then we installed the Hugging Face dependencies:
pip install transformers sentence-transformers accelerate
The first attempt installed a newer transformers version that was incompatible with the remote model code used by the Jina model. The error was:
ImportError: cannot import name 'find_pruneable_heads_and_indices'
from 'transformers.pytorch_utils'
We fixed this by uninstalling the packages and installing a transformers version below 5:
pip uninstall -y transformers sentence-transformers
pip install "transformers<5" "sentence-transformers>=2.3.0" accelerate
After that, the model loaded correctly on CPU and we were able to run small embedding experiments.
First embedding experiment
To get an initial feeling for the model, we wrote a small script that computes embeddings for pairs of simple code snippets, compares them with cosine similarity, and evaluates a threshold-based duplicate classifier.
from transformers import AutoModel
from numpy.linalg import norm
def cos_sim(a, b):
return (a @ b.T) / (norm(a) * norm(b))
def compute_accuracy(model, examples, threshold):
correct = 0
print(f"Using threshold = {threshold}\n")
for i, (code_a, code_b, is_duplicate) in enumerate(examples, start=1):
embeddings = model.encode([code_a, code_b])
score = cos_sim(embeddings[0], embeddings[1])
pred = score >= threshold
is_correct = pred == is_duplicate
correct += int(is_correct)
print(f"Example {i:02d}")
print(f"Cosine score: {score:.4f}")
print(f"Prediction: {'duplicate' if pred else 'not duplicate'}")
print(f"Gold label: {'duplicate' if is_duplicate else 'not duplicate'}")
print(f"Result: {'OK' if is_correct else 'WRONG'}")
print("-" * 60)
accuracy = correct / len(examples)
print(f"\nFinal accuracy @ threshold={threshold}: {accuracy:.2%}")
print(f"Correct: {correct}/{len(examples)}")
return accuracy
def main():
model_name = "jinaai/jina-embeddings-v2-base-code"
threshold = 0.7
print(f"Loading model: {model_name}")
model = AutoModel.from_pretrained(
model_name,
trust_remote_code=True,
)
examples = [
# Duplicate: same code
(
"def add(a, b):\n"
" return a + b",
"def add(a, b):\n"
" return a + b",
True,
),
# Duplicate: same behavior, different names
(
"def add(a, b):\n"
" return a + b",
"def sum_two(x, y):\n"
" return x + y",
True,
),
# Not duplicate: similar structure, different operation
(
"def add(a, b):\n"
" return a + b",
"def multiply(a, b):\n"
" return a * b",
False,
),
# Duplicate: even check, different style
(
"def is_even(x):\n"
" return x % 2 == 0",
"def check_even(n):\n"
" if n % 2 == 0:\n"
" return True\n"
" return False",
True,
),
# Not duplicate: opposite behavior
(
"def is_even(x):\n"
" return x % 2 == 0",
"def is_odd(x):\n"
" return x % 2 != 0",
False,
),
# Duplicate: same loop behavior
(
"def print_items(xs):\n"
" for x in xs:\n"
" print(x)",
"def show_values(values):\n"
" for value in values:\n"
" print(value)",
True,
),
# Not duplicate: both use a list, but different behavior
(
"def print_items(xs):\n"
" for x in xs:\n"
" print(x)",
"def count_items(xs):\n"
" return len(xs)",
False,
),
# Duplicate: same max behavior
(
"def get_max(xs):\n"
" return max(xs)",
"def largest(values):\n"
" return max(values)",
True,
),
# Not duplicate: max vs min
(
"def get_max(xs):\n"
" return max(xs)",
"def get_min(xs):\n"
" return min(xs)",
False,
),
# Duplicate: same sum behavior
(
"def sum_list(xs):\n"
" total = 0\n"
" for x in xs:\n"
" total += x\n"
" return total",
"def add_all(values):\n"
" result = 0\n"
" for value in values:\n"
" result = result + value\n"
" return result",
True,
),
# Not duplicate: sum vs product
(
"def sum_list(xs):\n"
" total = 0\n"
" for x in xs:\n"
" total += x\n"
" return total",
"def product_list(xs):\n"
" result = 1\n"
" for x in xs:\n"
" result *= x\n"
" return result",
False,
),
]
compute_accuracy(model, examples, threshold)
if __name__ == "__main__":
main()
This script is intentionally simple. It is not meant to be a serious benchmark yet. The goal is only to check whether the environment is working and the model can produce reasonable similarities for very small duplicate and non-duplicate examples.
Next steps
The next step is to understand where this method should fit inside ArKanjo's architecture. The current plan is to implement a new flexible method that can call an external or configurable embedding model, compute function-level embeddings, and compare candidate pairs using cosine similarity.
The method should expose at least two user-facing choices: the embedding model name and the similarity threshold. Starting with jinaai/jina-embeddings-v2-base-code is useful for prototyping, but the implementation should allow other encoder models to be used without changing the C++ method logic.
More broadly, this contribution explores whether modern pretrained code embeddings can complement ArKanjo's existing duplicate detection strategies. The goal is not to replace the current methods immediately, but to add an experimental path for semantic duplicate detection and then evaluate how useful it is on realistic code examples.