hypercluster SnakeMake pipeline¶

Line-by-line explanation of config.yml¶

Explanation for config.yml¶
config.yml parameter	Explanation	Example from scRNA-seq workflow
`input_data_folder`	Path to folder in which input data can be found. No / at the end.	`/input_data`
`input_data_files`	List of prefixes of data files. Exclude extension, .csv, .tsv and .txt allowed.	`['input_data1', 'input_data2']`
`gold_standard_file`	File name of gold_standard_file. Must have same pandas.read_csv kwargs as the corresponding input file. Must be in input_data_folder.	`{'input_data': 'gold_standard_file.txt'}`
`read_csv_kwargs`	Per input data file, keyword args to put into pandas.read_csv. If specifying multiindex, also put the same in output_kwargs[‘labels’]	`{'test_input': {'index_col':[0]}}`
`output_folder`	Path to folder in which results will be written. No / at the end.	`/hypercluster_results`
`intermediates_folder`	Name of the folder within the output_folder to put intermediate results, such as labels and evaluations per condition. No need to change this usually.	`clustering_intermediates`
`clustering_results`	Name of the folder within the output_folder to put final results. No need to change this usually.	`clustering`
`clusterer_kwargs`	Additional static keyword arguments to pass to individual clusterers. Not to optimize.	`KMeans: {'random_state':8}}`
`generate_parameters_addtl_kwargs`	Additonal keyword arguments for the hypercluster.AutoClusterer class.	`{'KMeans': {'random_search':true, 'param_weights': {'n_clusters': {5: 0.25, 6:0.75}}}`
`evaluations`	Names of evaluation metrics to use. See hypercluster.constants.inherent_metrics or hypercluster.constants.need_ground_truth	`['silhouette_score', 'number_clustered']`
`eval_kwargs`	Additional kwargs per evaluation metric function.	`{'silhouette_score': {'random_state': 8}}`
`metric_to_choose_best`	If picking best labels, which metric to maximize to choose the labels. If not choosing best labels, leave as empty string (‘’).	`silhouette_score`
`metric_to_compare_labels`	If comparing labeling result pairwise similarity, which metric to use. To not generate this comparison, leave blank/or empty string.	`adjusted_rand_score`
`compare_samples`	Whether to made a table and figure with counts of how often two samples are in the same cluster.	`true`
`output_kwargs`	pandas.to_csv and pandas.read_csv kwargs per output type. Generally, don’t need to change the evaluations kwargs, but labels index_col have to match index_col like in the read_csv_kwargs.	`{'evaluations': {'index_col':[0]}, 'labels': {'index_col':[0]}}`
`heatmap_kwargs`	Additional kwargs for seaborn.heatmap for visualizations.	`{'vmin':-2, 'vmax':2}`
`optimization_parameters`	Fun part! This is where you put which hyperparameters per algorithm to try.	`{'KMeans': {'n_clusters': [5, 6, 7]}}`

**Note: Formatting of lists and dictionaries can be in python syntax (like above) or yaml syntax, or a mixture, like below. **

config.yml example from scRNA-seq workflow ¶

input_data_folder: '.'
input_data_files:
  - sc_data
gold_standards:
  test_input: 'gold_standard.csv'
read_csv_kwargs:
  test_input: {'index_col':[0]}

output_folder: 'results'
intermediates_folder: 'clustering_intermediates'
clustering_results: 'clustering'

clusterer_kwargs: {}
generate_parameters_addtl_kwargs: {}

evaluations:
  - silhouette_score
  - calinski_harabasz_score
  - davies_bouldin_score
  - number_clustered
  - smallest_largest_clusters_ratio
  - smallest_cluster_ratio
eval_kwargs: {}

metric_to_choose_best: silhouette_score
metric_to_compare_labels: adjusted_rand_score
compare_samples: true

output_kwargs:
  evaluations:
    index_col: [0]
  labels:
    index_col: [0]
heatmap_kwargs: {}

optimization_parameters:
  HDBSCAN:
    min_cluster_size: &id002
    - 2
    - 3
    - 4
    - 5
  KMeans:
    n_clusters: &id001
    - 5
    - 6
    - 7
  MiniBatchKMeans:
    n_clusters: *id001
  OPTICS:
    min_samples: *id002

hypercluster SnakeMake pipeline¶

Line-by-line explanation of config.yml¶

config.yml example from scRNA-seq workflow¶

config.yml example from scRNA-seq workflow ¶