Tired of spending countless mintues downsampling your data? Look no further!
In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million minute-level records in only 41 seconds in DolphinDB.
The basic configuration of the DolphinDB server is:
16 CPU cores
256 GB memory
4 SSDs
A DolphinDB cluster with 4 data nodes is deployed, and each node uses a SSD.
[IMG]https://miro.medium.com/v2/resize:fit:720/format:webp/1*m7oACYfcwPFlh zMcTe8Tnw.png[/IMG]
The data we use is:
the level 1 quotes on August, 2007 from New York Stock Exchange
around 272 GB, with 6.48 billion records
Downsampling can be performed with a SQL statement in DolphinDB.
[IMG]https://miro.medium.com/v2/resize:fit:720/format:webp/1*TnmOqOphYWrjO _tVwSXz4w.png[/IMG]
As the SQL query may involve multiple partitions, DolphinDB breaks down the job into several tasks and assigns the tasks to the corresponding data nodes for parallel execution. When all the tasks are completed, the system merges the intermediate results from the nodes to return the final result.
The script is as follows:
The frequency can be adjusted as needed just by modifying bar(time, 60). Here 60 means the data is downsampled to 1-minute interval as the timestamp values have seconds precision.
[IMG]https://miro.medium.com/v2/resize:fit:720/format:webp/1*iJDnTB5cZ0prC UIzL0Ew4A.png[/IMG]
The table “quotes_minute_ sql“ is created with createPartition edTable and the downsampled result can be appended to this table.
[IMG]https://miro.medium.com/v2/resize:fit:720/format:webp/1*2kM7NKookUH_L KWVYZ7k6Q.png[/IMG]
We can execute the script and visit the web-based user interface to check the resource usage. It’s shown that all CPU cores have participated in the downsampling. On each data node, 15 tasks are running concurrently as data is being read from disk.
[IMG]https://miro.medium.com/v2/resize:fit:720/format:webp/1*gEzm3dCur7RbF cDe19f4eg.png[/IMG]
When we come back to VScode and check the execution status, we find that it only takes 41 seconds to complete the data downsampling, which generates 61 million minute-level records.
[IMG]https://miro.medium.com/v2/resize:fit:720/format:webp/1*38Cr8hLn5BiKL T898pQyWQ.png[/IMG]
DolphinDB exhibits outstanding performance in data downsampling due to the following reasons:
Jobs are executed distributedly and resources of different nodes can be utilized at the same time;
Compression reduces the disk I/O;
Columnar storage and vectorized computation improve the efficiency of aggregation.
To learn detailed operations of data downsampling, take a look at this demo!
Thanks for your reading! To keep up with our latest news, please follow our Twitter and Linkedin. You can also join our Slack to chat with the author!
Feel free to check our website for more information!
In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million minute-level records in only 41 seconds in DolphinDB.
The basic configuration of the DolphinDB server is:
16 CPU cores
256 GB memory
4 SSDs
A DolphinDB cluster with 4 data nodes is deployed, and each node uses a SSD.
[IMG]https://miro.medium.com/v2/resize:fit:720/format:webp/1*m7oACYfcwPFlh zMcTe8Tnw.png[/IMG]
The data we use is:
the level 1 quotes on August, 2007 from New York Stock Exchange
around 272 GB, with 6.48 billion records
Downsampling can be performed with a SQL statement in DolphinDB.
[IMG]https://miro.medium.com/v2/resize:fit:720/format:webp/1*TnmOqOphYWrjO _tVwSXz4w.png[/IMG]
As the SQL query may involve multiple partitions, DolphinDB breaks down the job into several tasks and assigns the tasks to the corresponding data nodes for parallel execution. When all the tasks are completed, the system merges the intermediate results from the nodes to return the final result.
The script is as follows:
Code:
db = database("dfs://TAQ")
quotes = db.loadTable("quotes")
select count(*) from quotes where date between 2007.08.01 : 2007.08.31
model=select top 1 symbol,date, minute(time) as minute, bid, ofr from quotes where date=2007.08.01,symbol=`EBAY
if(existsTable("dfs://TAQ", "quotes_minute_sql"))
db.dropTable("quotes_minute_sql")
db.createPartitionedTable(model, "quotes_minute_sql", `date`symbol)
timer{
minuteQuotes=select avg(bid) as bid, avg(ofr) as ofr from quotes where data between 2007.08.01 : 2007.08.31 group by symbol,date,bar(time, 60) as minute
loadTable("dfs://TAQ", "quotes_minute_sql").append!(minuteQuotes)
}
select count(*) from loadTable("dfs://TAQ", "quotes_minute")
[IMG]https://miro.medium.com/v2/resize:fit:720/format:webp/1*iJDnTB5cZ0prC UIzL0Ew4A.png[/IMG]
The table “quotes_minute_ sql“ is created with createPartition edTable and the downsampled result can be appended to this table.
[IMG]https://miro.medium.com/v2/resize:fit:720/format:webp/1*2kM7NKookUH_L KWVYZ7k6Q.png[/IMG]
We can execute the script and visit the web-based user interface to check the resource usage. It’s shown that all CPU cores have participated in the downsampling. On each data node, 15 tasks are running concurrently as data is being read from disk.
[IMG]https://miro.medium.com/v2/resize:fit:720/format:webp/1*gEzm3dCur7RbF cDe19f4eg.png[/IMG]
When we come back to VScode and check the execution status, we find that it only takes 41 seconds to complete the data downsampling, which generates 61 million minute-level records.
[IMG]https://miro.medium.com/v2/resize:fit:720/format:webp/1*38Cr8hLn5BiKL T898pQyWQ.png[/IMG]
DolphinDB exhibits outstanding performance in data downsampling due to the following reasons:
Jobs are executed distributedly and resources of different nodes can be utilized at the same time;
Compression reduces the disk I/O;
Columnar storage and vectorized computation improve the efficiency of aggregation.
To learn detailed operations of data downsampling, take a look at this demo!
Thanks for your reading! To keep up with our latest news, please follow our Twitter and Linkedin. You can also join our Slack to chat with the author!
Feel free to check our website for more information!