4
Instructors: Malachi Jones & Joe Mansour
Dates: June 23 to 26 2025
Capacity: 25
This course enhances reverse engineering (RE) processes through automation, focusing on efficiency and
scalability in malware and firmware analysis by integrating Neural Networks (NN), Natural Language
Processing (NLP), and Large Language Models (LLMs). Participants will be introduced to Blackfyre, an
open-source platform developed for the course, featuring a Ghidra plugin for initial data extraction of
a binary into a Binary Context Container (BCC), and a Python library for advanced data parsing and
lifting to VEX Intermediate Representation (IR). These features of Blackfyre are pivotal in facilitating
the analysis and extraction of data and metadata, which are critical for effectively applying NN, NLP,
and LLM techniques in RE tasks. The course's primary objective is to impart foundational knowledge and
practical methodologies in RE automation, equipping learners with essential tools and skills. It
emphasizes the application of these foundational elements, enabling learners to effectively utilize them
in their independent projects, while not necessarily aiming to provide a complete end-to-end automation
solution.
As the course progresses, students explore Neural Networks (NN) and Natural Language Processing (NLP)
techniques for tackling core reverse engineering challenges. NLP is used to convert textual features—such
as strings, imports, and function names—into embeddings for downstream analysis by NN models.
Applications include binary classification, anomaly detection, and function similarity analysis. The
course introduces BinaryRank, a static analysis algorithm inspired by PageRank, which ranks basic blocks
using call graphs and control flow. Unlike PageRank, BinaryRank is computationally efficient, operating
in linear time. Combined with Contrastive Learning, this approach improves the precision of binary
representations. Concepts are reinforced through targeted, hands-on labs.
The advanced module introduces LLMs for condensing and interpreting reverse engineering artifacts
produced through static and dynamic analysis. Key tasks include function and binary summarization and
LLM-assisted malware analysis. Subsampling methods are used to address model token limits while
maintaining data fidelity. Labs in this section focus on summarization and signature generation
workflows to support downstream reverse engineering tasks.
New in 2025, a dedicated fine-tuning lab has been added to provide hands-on experience training LLaMA
3.2 on real-world reverse engineering problems. Students will explore supervised fine-tuning techniques
using A6000 GPUs, incorporating LoRA and quantization strategies via Hugging Face Transformers and the
LLaMAFactory platform. Each participant is provisioned with an individual cloud GPU environment to
conduct fine-tuning experiments. This lab is designed to bridge research and application, giving
students direct exposure to the process of adapting open-source LLMs for security-relevant RE tasks.
This course is designed for reverse engineering practitioners with a strong foundation in reverse
engineering concepts and proficiency in Python object-oriented programming for hands-on labs.
Participants should also have a basic understanding of mathematical concepts (e.g., vectors, weighted
averages, and Euclidean distance) and foundational Machine Learning (ML) knowledge, including supervised
learning, feature extraction, and evaluation metrics (e.g., precision, and recall). This background is
essential for engaging with advanced ML topics such as transformers and fine-tuning Large Language
Models (LLMs). By meeting these prerequisites, participants will gain practical skills for applying advanced
NN/NLP/LLM techniques—including fine-tuning—to RE automation.
A strong foundation in reverse engineering—including assembly languages, calling conventions, file
formats, and Control Flow Graphs (CFGs)—is essential for this course. Participants should have at least
two years of hands-on experience and proficiency in Python object-oriented programming (e.g., classes,
inheritance, and polymorphism) for lab work.
The course explores advanced Machine Learning (ML), Natural Language Processing (NLP), and Large
Language Models (LLMs) for malware and firmware analysis. Students should be familiar with key ML
concepts such as supervised and unsupervised learning, feature extraction, data preprocessing,
overfitting, underfitting, and evaluation metrics like precision, recall, and F1-score, as these are
foundational to advanced topics like transformers and LLM fine-tuning.
Basic knowledge of mathematical concepts such as vectors, weighted averages, probabilities, gradient
descent, and distance metrics (e.g., Euclidean distance) is also critical for applying ML techniques to
automate reverse engineering tasks.
This course is ideal for reverse engineering practitioners aiming to delve into the automation of reverse engineering processes, chiefly in malware and firmware analysis, using Machine Learning (ML), Natural Language Processing (NLP), and Large Language Models (LLMs). It provides a blend of theoretical insights and practical skills to tackle modern challenges in malware and firmware analysis, and lays groundwork for independent exploration(outside of the course) of other areas like vulnerability analysis and software bill of materials (SBOM) generation. With a structured curriculum, the course serves as a valuable stepping stone for those looking to advance their expertise in automating RE tasks within the cybersecurity domain.
Individuals may not find this course suitable if they are not proficient in Python object-oriented programming, which is vital for navigating the course labs. Those who are uncomfortable with key mathematical concepts such as vectors, weighted averages, and metrics (e.g., Euclidean distance) may find the technical depth challenging. While the course explores Machine Learning (ML), Natural Language Processing (NLP), and Large Language Models (LLMs), it is principally designed for automating specific reverse engineering processes in malware and firmware analysis. Therefore, those seeking a broader or more generalized education on ML, NLP, or LLMs without a focus on reverse engineering may not find the course aligning with their learning objectives.
Students should ensure they have a laptop with a minimum of 32 GB RAM, 250 GB of free disk space, and a processor with at least 4 cores, equivalent to an Intel i7 or higher. The processor must be an x86_64 architecture to ensure compatibility with the course-provided virtual machine (VM) and to run VirtualBox version 7.1 or later. Additionally, the processor must support AVX (Advanced Vector Extensions), which are required for running machine learning frameworks such as TensorFlow and PyTorch. Connectivity capabilities are also essential for accessing external services used in the Large Language Models (LLMs) components of the course. VirtualBox should be pre-installed to enable participation in the hands-on labs and exercises.
Dr. Malachi Jones is a Principal Cyber Security AI/LLM Researcher at Microsoft, working within the
Microsoft
Security AI (MSECAI) team. His current work focuses on fine-tuning Large Language Models (LLMs) for
security
applications, optimizing their performance for tasks such as reverse engineering (RE) and malware
analysis.
Dr. Jones leads the development of reverse engineering capabilities for Security Copilot, integrating
advanced static and dynamic analysis techniques, including sandbox detonation, with large language
models
(LLMs) to enhance malware detection and analysis workflows.
With over 15 years of experience in security research, Dr. Jones has made significant contributions to
both
academic and industrial sectors. Prior to his work at Microsoft, he held a role at MITRE, where he
applied
machine learning (ML) and intermediate representation (IR) languages to automate reverse engineering
tasks.
During his tenure, he also developed and taught a course titled "Automating Reverse Engineering with
Machine
Learning and Binary Analysis." At Booz Allen Dark Labs, Dr. Jones specialized in embedded security
research
and developed LLVM IR-based tools for automated vulnerability assessment, co-authoring US Patent
10,133,871.
In addition to his work at Microsoft, Dr. Jones is the founder of Jones Cyber-AI, a company dedicated to
his
independent research and teaching initiatives. Through Jones Cyber-AI, he has taught his specialized
course,
"Automating Reverse Engineering Processes with AI/ML, NLP, and LLMs," at leading conferences such as
Black
Hat USA (2019, 2021, 2023, and 2024) and RECON Montreal (2023 and 2024). His independent research into
AI,
ML, NLP, and LLMs ensures his courses remain cutting-edge and relevant to the latest advancements in
cybersecurity.
Dr. Jones has also demonstrated a commitment to education, having served as an Adjunct Professor at the
University of Maryland, College Park, from 2019 to 2020, where he taught "Machine Learning Techniques
Applied to Cybersecurity."
Dr. Jones holds a B.S. in Computer Engineering from the University of Florida and both an M.S. and Ph.D.
from Georgia Tech, where his research focused on applying game theory to cybersecurity challenges. His
expertise in AI, ML, LLMs, and cybersecurity continues to drive innovations in reverse engineering and
broader cybersecurity solutions.
Joe Mansour is a Security Researcher at Microsoft. With a focus on reverse engineering malware, he
develops detections to protect customers. His expertise is rooted in a background that spans red
teaming, vulnerability assessment, and hardware hacking. Joe has contributed to projects involving
automated reverse engineering showcasing his aptitude for binary analysis and tool development to
simplify the complexities of reverse engineering.
He holds an M.S. in Computer Science from Johns Hopkins University and a B.S. from the University of
Illinois at Urbana-Champaign.
Click here to register.