Skip to content

Explanations of Parameters

jordrow edited this page Apr 15, 2020 · 12 revisions

SIP was designed to run quickly and to provide users with complete control so that optimization of loops is feasible. To optimize parameters we suggest running SIP on a single chromosome (only include one chromosome in the size file) and checking the results visually in Juicebox or in HiGlass. This guide is meant to provide information for the parameters during the optimization process.

hic or processed method:

The program has three input options cool , hic , or processed. With the cool option, your input should be a .mcool file. With the hic option, your input should be a juicer derived .hic file. Advice: Run our method with the hic file q30.

For example: https://hicfiles.s3.amazonaws.com/hiseq/gm12878/in-situ/combined_30.hic

Note: A web address to a .hic file can be used, but we recommend downloading the .mcool or .hic file first to run SIP faster.

Use the processed option if you have already run SIP and are optimizing parameters. This avoids the time it takes to dump the observed values from the .hic or .mcool file. Alternatively, users can create their own processed files with the following format:

The processed files have this format (tabulated file without header):

chr1 chr2 value distance normalized value

560000 565000 1499.2633 5.84271154556146

If generating your own processed files, divide the chromosome into small chunks (10 Mb) with all the chunks in each chromosome placed in a directory with the chromosome name. Each chromosome should have own directory with multiple files corresponding to each chunk. The file name recognized by SIP requires the chromosome name, the start coordinate of the chunk, and the end coordinate of the chunk (eg: 1_5000000_14999999.txt). The program will run for each chromosome specified in the chromosome size file. If you want to run SIP on single chromosome, then only include that chromosome in your size file.

SIP was meant to run quickly with user-control of many parameters. Therefore we recommend testing several times on a single chromosome and changing the parameters each time to best fit with the data. Pay particular attention to -g and -fdr options.

-res

Choose the resolution that best suits the data. We recommend 5 kb for high resolution datasets and 10 kb for others. You can also choose to call loops at 5, 10, and 25 kb by setting this to 5000 and then setting - factor 4. The resolution chosen must be present in the .hic file. If your matrix is sparse choosing a larger bin size here may provide better results. The resolution must be in the .hic file.

-mat

This determines the size of each chunk of the matrix. For example -mat 2000 at 5 kb resolution will process 10 Mb at a time. To process 10 Mb chunks at 10 kb resolution, set -mat 1000. Therefore, if you alter -res you may want to consider altering -mat accordingly. Changing this will alter the sizes of the matrices stored and can affect the computation time and memory.

-d

The number of bins around the diagonal to ignore which affects the minimum size of the loop that is called. This should be changed empirically based on the noise at the diagonal in each dataset. For example -d 6 will remove 6 pixels at the diagonal. This will remove 30 kb at 5 kb resolution, but 60 kb at 10 kb resolution. We encourage users to adjust -d when altering -res. Increasing -d reduces the number of false positives near the diagonal due to high variability in this intense signal.

-g

The amount of smoothing applied to help decrease artifacts. This helps to remove potential false positives corresponding to isolated pixels of intense signal. The default of 1.5 is for a human cell line with 2.4 billion intra-chromosomal read binned at 5 kb resolution. However, users may wish to change this setting based upon the noise in the data. For example if there are many false positives called due to isolated intense pixels, we suggest increasing -g. If there are many loops that are not called, users can decrease -g. Similarly, at 10 kb or 25 kb resolutions, loops may appear more punctate and thus a lower -g could be pertinent. We suggest altering -g and -fdr the most during optimization. During processing, SIP outputs the number of loops identified by the image processing before fdr filtering. If not enough loops are identified during the image processing, you may wish to decrease -g.

-cpu

Specifies the number of CPUs used for SIP processing (default 1). We suggest one per chromosome.

-factor

Specifies if loop calling should be performed at multiple resolutions. Factor of 1 means just call loops at the specified -res.

-max

Increases the value around the pixels of high intensity during image analysis. Increase the max filter will enlarge the size of the bright spot. It is better to keep the same value for the min and max filter.

-min

Removes isolated high intensity pixels during image analysis. Increase the min filter will remove more small bright spot in enriched region and can remove some potential loop. It is better to keep the same value for the min and max filter.

-sat

Percentage of saturated pixels in each matrix used to compute the contrast. If the signal of your HiC data is low and sparse you can increase it.

-t

The initial threshold at which loops are detected. Increase it and you will obtain fewer loops that are more robust, decrease and you will obtain more loops with a lot of potential false positives.

-nbZero

Used to filter false positives close to unmappable regions by examining the number of pixels in the neighborhood that are zero. A lower value could cause false positives to be called near repetitive regions. If you notice false positives near sparse data, try increasing this value.

-norm

The normalization scheme to use. We recommend Knight-Ruiz (KR / balanced) from juicer. Must be present in the .hic file. However, without enough sequencing depth it may be impossible to perform matrix balancing for some chromosomes. We recommend ensuring that each chromosome has a KR vector before starting. Otherwise VC_SQRT is a suitable alternative.

-del

Whether or not to delete the image files created during loop detection.

-fdr

The empirical fdr value to use. This sets the number of loops that are less than or equal to the enrichment of nearby random sites. As mentioned above, during processing SIP outputs the number of loops called during the image recognition vs after the FDR filtering. If too many loops are removed during FDR filtering, users can decrease the stringency. The -g and -fdr options are the parameters we recommend optimizing the most. For example, with less smoothing, you may want a stricter FDR and vice versa.

-isDroso:

Default is false. Set this option to apply a specific filter due to looping characteristics in D. mel. You can used this parameter if your HiC map is similar to Drosophila one, where loops do not show the same decay as in human cells. Setting this parameter removes the regional enrichment filter.

-isAccurate: NOTE SINCE v1.3.9 THIS OPTION HAS BEEN DISABLED AS WE UPDATED SIP TO ALWAYS ENSURE CONSISTENCY

Default is false. Set this option to sacrifice speed but to ensure the same list of loops gets called each time SIP is run. If left as false, SIP will quickly output a list of loops, but may miss a few here or there. We recommend the default while optimizing loop calling parameters. After you have settled on final parameters, set -isAccurate in order to get a consistent set of final loops.

Clone this wiki locally