hypercluster SnakeMake pipeline¶
Line-by-line explanation of config.yml¶
config.yml parameter |
Explanation |
Example from scRNA-seq workflow |
---|---|---|
|
Path to folder in which input data can be found. No / at the end. |
|
|
List of prefixes of data files. Exclude extension, .csv, .tsv and .txt
allowed.
|
|
|
File name of gold_standard_file. Must have same pandas.read_csv kwargs
as the corresponding input file. Must be in input_data_folder.
|
|
|
Per input data file, keyword args to put into pandas.read_csv.
If specifying multiindex, also put the same in output_kwargs[‘labels’]
|
|
|
Path to folder in which results will be written. No / at the end. |
|
|
Name of the folder within the output_folder to put intermediate results,
such as labels and evaluations per condition. No need to change this usually.
|
|
|
Name of the folder within the output_folder to put final results.
No need to change this usually.
|
|
|
Additional static keyword arguments to pass to individual clusterers.
Not to optimize.
|
|
|
Additonal keyword arguments for the hypercluster.AutoClusterer class. |
|
|
Names of evaluation metrics to use. See
hypercluster.constants.inherent_metrics or
hypercluster.constants.need_ground_truth
|
|
|
Additional kwargs per evaluation metric function. |
|
|
If picking best labels, which metric to maximize to choose the labels. If not choosing
best labels, leave as empty string (‘’).
|
|
|
If comparing labeling result pairwise similarity, which metric to use. To not generate
this comparison, leave blank/or empty string.
|
|
|
Whether to made a table and figure with counts of how often two samples are in the same
cluster.
|
|
|
pandas.to_csv and pandas.read_csv kwargs per output type. Generally,
don’t need to change the evaluations kwargs, but labels index_col have to
match index_col like in the read_csv_kwargs.
|
|
|
Additional kwargs for seaborn.heatmap for visualizations. |
|
|
Fun part! This is where you put which hyperparameters per algorithm to try. |
|
**Note: Formatting of lists and dictionaries can be in python syntax (like above) or yaml syntax, or a mixture, like below. **
config.yml example from scRNA-seq workflow¶
input_data_folder: '.'
input_data_files:
- sc_data
gold_standards:
test_input: 'gold_standard.csv'
read_csv_kwargs:
test_input: {'index_col':[0]}
output_folder: 'results'
intermediates_folder: 'clustering_intermediates'
clustering_results: 'clustering'
clusterer_kwargs: {}
generate_parameters_addtl_kwargs: {}
evaluations:
- silhouette_score
- calinski_harabasz_score
- davies_bouldin_score
- number_clustered
- smallest_largest_clusters_ratio
- smallest_cluster_ratio
eval_kwargs: {}
metric_to_choose_best: silhouette_score
metric_to_compare_labels: adjusted_rand_score
compare_samples: true
output_kwargs:
evaluations:
index_col: [0]
labels:
index_col: [0]
heatmap_kwargs: {}
optimization_parameters:
HDBSCAN:
min_cluster_size: &id002
- 2
- 3
- 4
- 5
KMeans:
n_clusters: &id001
- 5
- 6
- 7
MiniBatchKMeans:
n_clusters: *id001
OPTICS:
min_samples: *id002