SQL to KQL fun with the Mitre APT29 Day 1 dataset
By Gianni Castaldi
I’ve read a great blog post by Jose Rodriguez on how he converted SQL to Panda. I like the write-up, but it would be even better if we can use it in your favorite next-gen SIEM: Azure Sentinel. This is why I took his examples and converted the SQL statements to KQL operators and functions.
In this blog post I’ll write about the following statements:
SELECT
WHERE
AND/OR
IS NULL
GROUP BY
I’ve deployed the 2GB, APT29 dataset according to the blog post by Roberto Rodriguez.
Once the data has been imported we want to know how the data looks like. We can do this by retrieving a single line. We can do this by using one of the following operators.
- take
- The take operator does not use any other parameters than the number of rows you want to have. Therefore it is not guaranteed that the items appear on the order you would expect them.
- top
- The top operator does take parameters other than the number of rows you want to have. In my example we filter the table on TimeGenerated and sort it ascending so we see the first event.
prerecorded_CL
| where TimeGenerated > ago(7d)
| top 1 by TimeGenerated asc

Let’s start with the actual querying. The first and most common statement used in SQL is the SELECT statement. We use this to select columns from a table and then limit it to 5:
SELECT Hostname, Channel, EventTime, EventID
FROM apt29
LIMIT 5;
So if we want to convert this to KQL we need to make sure that we can filter on the items. This is because all the items are in the Message column.

There are several ways to query the data Message column but for this blog post, we’ll convert the Message which is a string to a dictionary, and then expand it so all items in the message get a column with the “msg_” prefix. After that, we will select the required columns with the project operator and then select the first 5 items with the top operator.
prerecorded_CL
| extend MSG = todynamic(Message)
| evaluate bag_unpack(MSG, "msg_")
| project msg_Hostname, msg_Channel, msg_EventTime, msg_EventID
| top 5 by msg_EventTime asc

After selecting the data the next step would be to consume the data. We can do this by counting events, by channel, and hostname. But a simple way would be to get the length of the Hostnames.
In SQL we would select the columns and use the len function to get the length and store it as Hostname_Length:
SELECT Hostname, Channel, EventTime, EventID,
len(Hostname) as Hostname_Length
FROM apt29
LIMIT 5;
In KQL we will do this the other way around. We use the extend operator to create a new column and then add the value we want it to contain.
prerecorded_CL
| extend MSG = todynamic(Message)
| evaluate bag_unpack(MSG, "msg_")
| project msg_Hostname, msg_Channel, msg_EventTime, msg_EventID
| top 5 by msg_EventTime asc
| extend Hostname_Length = strlen(msg_Hostname)

Bear in mind that this way of querying will take a lot of time because it first parses all the records and then selects the top 5. If you run KQL in production environments you first filter on TimeGenerated and then modify or consume your data.
Since we know how to consume data the next step will be to filter the data. In SQL we would use the WHERE statement followed by the column and the value:
SELECT Hostname, Channel, EventTime, EventID,
len(Hostname) as Hostname_Length
FROM apt29
WHERE Channel = 'Windows PowerShell'
LIMIT 5;
Now in KQL we will also use the where operator but we have several ways to get the results:
- == returns all the values but is case sensitive
- =~ returns all the values but is case insensitive
- has returns all the values but looks for whole terms
- contains returns all values but also returns subsequences
Do note that since contains string operator looks for subsequences it is a costly and long operation. That’s why I recommend to only use contains in very specific cases where you want to do some partial searches. In a future blog post, we will play a bit with string operators, but for now, you can read everything in the following docs page.
prerecorded_CL
| extend MSG = todynamic(Message)
| evaluate bag_unpack(MSG, "msg_")
| project msg_Hostname, msg_Channel, msg_EventTime, msg_EventID
| extend Hostname_Length = strlen(msg_Hostname)
| where msg_Channel == "Windows PowerShell"
| top 5 by msg_EventTime asc

Ok, so we know that there are Powershell events, but what if we would like to which processes were started? This is where the AND / OR operators come in to play, we can query the Sysmon events and Event ID 1 for process creations.
SELECT Hostname, Channel, EventTime, EventID, CommandLine
len(CommandLine) as CommandLineLength
FROM apt29
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational' AND EventID = 1
LIMIT 5;
In this case we will change the query of the where operator to look for the channel: Microsoft-Windows-Sysmon/Operational and where the EventID is 1.
prerecorded_CL
| extend MSG = todynamic(Message)
| evaluate bag_unpack(MSG, "msg_")
| project msg_Hostname, msg_Channel, msg_EventTime, msg_EventID, msg_CommandLine
| extend CommandLine_Length = strlen(msg_CommandLine)
| where msg_Channel == "Microsoft-Windows-Sysmon/Operational" and msg_EventID == "1"
| top 5 by msg_EventTime asc

In the next step we will query the Sysmon data and query for the registry events.
SELECT Channel, EventID, ProcessGuid, ProcessId, Image, TargetObject
FROM apt29
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational' AND (EventID = 12 OR eventID = 13)
LIMIT 5;
Besides the and operator, we can also use the or operator to look for Event ID 12 (create or delete) or 13 (value set). To achieve the same query results every time we modify the top 5 search to return the top 5 by Channel.
prerecorded_CL
| extend MSG = todynamic(Message)
| evaluate bag_unpack(MSG, "msg_")
| project msg_Channel, msg_EventID, msg_ProcessGuid, msg_ProcessId, msg_Image, msg_TargetObject
| where msg_Channel == "Microsoft-Windows-Sysmon/Operational" and (msg_EventID == "12" or msg_EventID == "13")
| top 5 by msg_Channel asc

The next step is in our hunt is to query all process creations with an empty command line. Empty command lines can appear when an attacker tries to obfuscate its actions. In SQL we would do this with the IS NULL statement
SELECT Hostname, Channel, EventTime, EventID, CommandLine
len(CommandLine) as CommandLineLength
FROM apt29
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational' AND EventID = 1 AND CommandLine IS NULL
LIMIT 5;
In KQL we have to options to search for items that are empty.
- isnull
- This operator returns where the value does not exist.
- isempty
- This operator also returns rows where the value is empty.
In this case we will use the is empty operator.
prerecorded_CL
| extend MSG = todynamic(Message)
| evaluate bag_unpack(MSG, "msg_")
| project msg_Hostname, msg_Channel, msg_EventTime, msg_EventID, msg_CommandLine
| extend CommandLine_Length = strlen(msg_CommandLine)
| where msg_Channel == "Microsoft-Windows-Sysmon/Operational" and msg_EventID == "1" and isempty(msg_CommandLine)
| top 5 by msg_EventTime asc

In the next step we will query for all rows with a command line.
SELECT Hostname, Channel, EventTime, EventID, CommandLine
len(CommandLine) as CommandLineLength
FROM apt29
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational' AND EventID = 1 AND CommandLine IS NOT NULL
LIMIT 5;
We will use the isnotempty operator.
prerecorded_CL
| extend MSG = todynamic(Message)
| evaluate bag_unpack(MSG, "msg_")
| project msg_Hostname, msg_Channel, msg_EventTime, msg_EventID, msg_CommandLine
| extend CommandLine_Length = strlen(msg_CommandLine)
| where msg_Channel == "Microsoft-Windows-Sysmon/Operational" and msg_EventID == "1" and isnotempty(msg_CommandLine)
| top 5 by msg_EventTime asc

Ok we are able to look for the data we want, the next logical step would be to group the data. In SQL we would use GROUP BY statement.
SELECT Hostname, Channel, EventTime, EventID, CommandLine, Image,
len(CommandLine) as CommandLineLength, count(*)
FROM apt29
GROUP BY Hostname
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational' AND EventID = 1;
In KQL we will use the Summarize operator and the count function.
prerecorded_CL
| extend MSG = todynamic(Message)
| evaluate bag_unpack(MSG, "msg_")
| project msg_Hostname, msg_Channel, msg_EventTime, msg_EventID, msg_CommandLine , msg_Image
| extend CommandLine_Length = strlen(msg_CommandLine)
| where msg_Channel == "Microsoft-Windows-Sysmon/Operational" and msg_EventID == "1"
| summarize count() by msg_Hostname

To count occurrences can be great, but we also would like to see the first and last observed events. To do this we use the min() and max() functions.
prerecorded_CL
| extend MSG = todynamic(Message)
| evaluate bag_unpack(MSG, "msg_")
| project msg_Hostname, msg_Channel, msg_EventTime, msg_EventID, msg_CommandLine , msg_Image
| extend CommandLine_Length = strlen(msg_CommandLine)
| where msg_Channel == "Microsoft-Windows-Sysmon/Operational" and msg_EventID == "1"
| summarize FirstSeen = min(msg_EventTime), LastObserved = max(msg_EventTime), count() by msg_Hostname

So to wrap things up, we have imported a great dataset which is good to improve our hunting skills and test our queries. Next, we have converted some of the basic SQL statements to KQL operators and functions. And we have learned some of the functionality of KQL.
Thanks for reading and if you have any questions or ideas for a blog post let me know.