These are source codes for the paper "Estimating Simplet Counts via Sampling".
This work is an extended version of "Characterization of Simplicial Complexes by Counting Simplets Beyond Four Nodes".
- SC3 is the algorithm for counting simplets of size 4, 5, and 6 based on the Color Coding method where simplets are isomorphic classes of connected induced subcomplexes.
- SC3, which extends CC (Color Coding) method from graphs to simplicial complexes, has the following properties:
- accurate compared to baselines
- fast and scalable
- Especially, the counts of simplets by SC3 have strong power to characterize SCs domain by domain.
- The original SC3 code, which supports size 4 and 5 simplets, can be found here.
- In this repository, we extend the simplet size to 6 and include SC3-E, a memory-efficient version of SC3.
- SCRW is the restricted access model on simplicial complexes for estimating simplet concentration.
- SCRW, which extends RW (Random Walk) method from graphs to simplicial complexes, has the following properties:
- comparable speed and accuracy compared to SC3
- memory efficient
- Real-world datasets are available here. The format of the data is exactly as provided, and for SC3, only "{data}-nverts.txt" and "{data}-simplices.txt" are necessary.
- Each dataset we used in our SCRW experiments is the LCC of the original one. This was pre-processed to satisfy maximality and is available here.
- [Statistics of datasets we used]
- ‘-’ implies that the statistic is the same as that of its full SC.
| dataset | # of vertices | # of maximal simplices |
|---|---|---|
| coauth-DBLP | 1,924,991 | 1,730,664 |
| -LCC | 1,654,109 | 1,563,050 |
| coauth-MAG-Geology | 1,256,385 | 925,027 |
| -LCC | 898,648 | 681,954 |
| coauth-MAG-History | 1,014,734 | 774,495 |
| -LCC | 219,435 | 130,579 |
| congress-bills | 1,718 | 48,898 |
| -LCC | - | 30,757 |
| contact-high-school | 327 | 4,862 |
| -LCC | - | - |
| contact-primary-school | 242 | 8,010 |
| -LCC | - | - |
| DAWN | 2,558 | 72,421 |
| -LCC | 2,290 | 72,153 |
| email-Eu | 979 | 8,038 |
| -LCC | - | - |
| email-Enron | 143 | 433 |
| -LCC | - | - |
| NDC-classes | 1,161 | 563 |
| -LCC | 628 | 324 |
| NDC-substances | 5,311 | 6,555 |
| -LCC | 3,065 | 4,533 |
| tags-ask-ubuntu | 3,029 | 95,639 |
| -LCC | 3,021 | 95,631 |
| tags-stak-overflow | 49,998 | 3,781,574 |
| -LCC | - | - |
| threads-ask-ubuntu | 125,602 | 149,025 |
| -LCC | 82,075 | 109,292 |
| threads-math-sx | 176,445 | 519,573 |
| -LCC | 152,702 | 496,379 |
| threads-stack-overflow | 2,675,955 | 8,694,667 |
| -LCC | 2,301,070 | 8,330,001 |
- (csv) timestamp for each step (directory: result\timestamp)
- (csv) count of every simplet of size k with a format ({id of a simplet}, {counts}\n) (directory: result\CC for SC3, result\RW for SCRW)
- Select algorithms between SC3 (directory: src_CC) and SCRW (directory: src).
- command "bash run.sh". Feel free to change the variable k, ss, datas, trials, threads in the file "run.sh"
k="4 5 6" # a list of the size of simplets
ss="100 1000 10000 100000" #a list of the number of samples
datas="toy" #a list of the names of datasets
trials=5 #the number of trials for running SC3/SCRW
threads=6 #the number of multi-threads
- For SC3, choose a version between SC3 and SC3-E (variable: ver in the file "run.sh").
ver=1 #origial SC3
ver=2 #memory efficient SC3: SC3-E