SIMPLE banner

How to do single-particle reconstruction with SIMPLE 1.0

SIMPLE 1.0_single_particle_reconstruction.pdf

Installation of SIMPLE 1.0

>>> mv Downloads/simple_mac_Dec18_2011.tar.gz /usr/local

>>> gunzip simple_mac_Dec18_2011.tar.gz

>>> tar -xvf simple_mac_Dec18_2011.tar

>>> cd simple

>>> ./

Installation on hopper/local/biox2?

Hopper and Biox2 are clusters that we use for distributed SIMPLE 1.0 execution. If you want a distributed SIMPLE 1.0 version you will have to modify the module in the lib folder.

>>> local

The files 'source_tcsh' and 'source_bash' contain the two lines you need to add to your .cshrc or .bashrc file, respectively, in order to have prompt access to the SIMPLE 1.0 programs. If you are compiling SIMPLE 1.0 from source, the compiler options used by the compile script are valid for the Intel Fortran compiler, assumed to be executed from the prompt with the command ifort. Currently, no other compilers are supported. Execute the compile script to get instructions.

Program: cluster

  cluster is a program for image clustering based on reference-free in-plane rotational alignment (Penczek et al., 1992) and probabilistic principal component analysis (PCA) for generation of feature vectors (Tipping and Bishop, 1999). Agglomerative hierarchical clustering (AHC) with complete linkage is used for grouping of feature vectors (Murtagh, 1983). Refinement of the clustering solution is done with the center-based iterative k-means clustering. cluster in-plane aligns the input image stack. Bicubic interpolation is used for shifting and rotating the stack before extraction of the pixels within the circular mask defined by mask radius ring2. Next, the probabilistic PCA method generates feature vectors from the vectors of extracted pixels. The 'cluster.log' file describes what was done in each step and lists the output files. The minimum class population (minp) prevents clusters below population minp to be represented by an output average. Clean up the class averages stack before generation of a Fourier stack and input to program origami, described below.

>>> cluster stk=ptclstk.spi box=<box size(in pixels)> nptcls=<nr of images in stack> smpd=<sampling distance(in Å)> [ring1=<first ring(in pixels){5}>] [ring2=<second ring(in pixels){box/2}>] [ncls=<nr clusters{500}>] [minp=<minimum nr ptcls cluster{10}>] [nvars=<nr eigenvectors{30/60}>] [nthr=<nr openMP threads{1}>] [nran=<size of random sample{nptcls}>] [oritab=<SIMPLE 1.0 alignment doc>] [clsdoc=<Spider clustering doc>] [dopca=<yes|no>] [doalign=<yes|no>] [debug=<yes|no>]

The number of clusters is a critical parameter. The setup allows for quick testing of the number of clusters. One pass produces the file 'pdfile.bin' containing the matrix with all pair-wise feature vector distances. Using the optional parameter dopca set to 'no' in a second round of execution from the same directory will make the program read the previously generated distances and re-do the clustering using whatever settings inputted for parameters ncls & minp. The optional parameter oritab is used to provide in-plane parameters for the clustering (provided by programs align or cycler, described below). This option is used for generating class averages that are going to be subjected to heterogeneity analysis by program origami, described below. The number of eigenvectors is also critical parameter. The default setting uses 30 eigenvectors if you are not inputting in-plane parameters via optional parameter oritab, and 60 eigenvectors if you do input in-plane parameters. Note that the distance matrix is kept in RAM, so for large data sets you need LOTS of internal memory. This quirk can be addressed by using a random sample of the data for initial clustering by HAC. This is done by setting nran to some number < nptcls. In this setting, the HAC centers generated from the random sample are used to extend the clustering to the entire data set with k-means. This overcomes the well-known initialization problem of k-means and enables clustering of many hundreds of thousands of particle images. SIMPLE 1.0 has been used to cluster 300,000 images with a box size of 100 using a random subset of 60,000 images on a machine with 96 GB RAM.

Program: spi_to_fim

  spi_to_fim is a program for generating and stacking 2D Fourier transforms using a spider stack of real images as input. EMAN and Spider generate such image stacks. Output consists of the files outbdy.fim (stack of transforms) and outbdy.hed (header) that are used by virtually all other SIMPLE 1.0 programs.

>>> spi_to_fim stk=spistackin.spi box=<box size(in pixels)> nptcls=<nr of images in stack> smpd=<sampling distance(in Å)> outbdy=<body of output files> [fromp=<start ptcl>] [top=<stop ptcl>] [msk=<mask radius(in pixels)>] [pad=<yes|no>] [debug=<yes|no>]

Beware that the mask parameter is optional despite that the images subjected to Fourier transformation need to be masked with a soft mask to avoid edge-induced artifacts. If the input images to spi_to_fim are not masked beforehand, you must apply a soft mask here using the optional msk parameter. Setting the optional parameters pad to 'yes' will pad the images to twice the size before Fourier transformation. Padded Fourier stacks are used for reconstruction. It is indicated in the below described individual programs if the input Fourier stack is expected to be padded. It may be convenient for you to generate the padded stack now. Setting the optional parameter debug to 'yes' will not produce the full stack, but produce a spider stack (debug.spi) with one image with all transformations (masking, shifting & padding) applied to the first image of the input stack.

Program: fim_to_spi

  Since there is a program spi_to_fim for generating stacks of Fourier transforms there is naturally also a program for back transformation of the Fourier stack: fim_to_spi. Output consists of a spider image stack.

>>> fim_to_spi fstk=fprojs.fim outstk=rprojs.spi [debug=<yes|no>]

Set the optional parameter debug to “yes” if you want to print the stack header. This may be useful for checking that the stack has been generated using the intended input parameters.

Program: origami

  origami is a program for reference-free 3D alignment and heterogeneity analysis of class averages using optimization of a spectral ensemble common line correlation coefficient (Elmlund et al., 2010; Elmlund et al., 2008). The coefficient is called 'spectral' because it uses adaptive low-pass filtering methods for improved search behavior and noise robustness. The term 'ensemble' coefficient was coined to distinguish the method from angular reconstitution, which uses an anchor set of only a few ab initio aligned class averages to orient the remaining ones (van Heel, 1987). Projection direction assignment is done in a discrete space using the constraint that two class averages are not allowed to occupy the same direction. Simulated annealing is used for the combinatorial optimization. The first discrete alignment is restarted three times on random starting configurations, and the best solution is written to the first alignment document. Finally, the additional origin shift and conformational state parameters are included in a greedy adaptive local search based scheme. This scheme steers a mixed continuous-discrete differential evolution optimizer operating over the six degrees of freedom of one Fourier plane. Input to origami is a Fourier stack of class averages 'cavgstk.fim' generated by cluster. Output from origami consists of reconstructed volumes, with and without solvent flattening applied to reduce the background noise. The automatically generated 'origami.log' lists the output files.

>>> origami fstk=cavgstk.fim lp=<low-pass limit(in Å){15-30}> [froms=<number of states from>] [tos=<number of states to>] [maxits=<nr of rounds{10}>] [msk=<mask radius(in pixels)>] [mw=<molecular weight(in kD){0}>] [frac=<fraction of ptcls to include{0.8}>] [amsklp=<auto mask low-pass limit(in Å){50}>] [edge=<edge size for softening of molecular envelope(in pixels){3}>] [nthr=<nr of openMP threads{1}>] [trs=<origin shift(in pixels){3}>] [hp=<high-pass limit(in Å){100}>] [oritab=<input alignment doc>] [pgrp=<cn|dn>] [debug=<yes|no>]

The outcome of origami depends on the quality of the class averages. The low-pass limit should hence be set to reflect the information content in the averages. A limit of 20 Å works for most purposes. Origami can be executed in three modes. (1) Ab initio reconstruction mode, (2) Early stage heterogeneity analysis mode, and (3) Late stage heterogeneity analysis mode. In mode (1) class averages have been obtained in cluster using the reference-free 2D alignment algorithm (not inputting model-based alignment document), and execution might look like:

>>> origami fstk=cavgstk.fim lp=20 nthr=8 mw=500

In mode (2) you have generated model-based in-plane parameters with cycler or align and inputted them to cluster to generate class averages. Still, you are somewhat concerned about bias to the low-resolution inital volume and want to determine projection directions ab initio. Execution might look like:

>>> origami fstk=cavgstk.fim lp=20 nthr=8 mw=500 tos=3

In mode (3) you are confident with the alignment obtained by cycler or align, and you have generated beautiful class averages with cluster using these in-plane parameters as input. Now, use align or cycler to align the class averages back to the best so far reconstruction(s). Input the class average alignment document to origami and turn off the reference-free orientation refinement:

>>> origami fstk=cavgstk.fim lp=20 nthr=8 mw=500 tos=3 oritab=algndoc.dat doalign=no

origami tests different number of states, from froms to tos. The number of states is a critical parameter that needs to be carefully addressed. When the number of states is increased, the number of parameters in the model is increased and the correlation will always improve. This is an example of an over fitting problem. In attempt to determine weather the increase from s to s+1 number of states gives rise to a significant correlation gain, the two distributions of individual common line correlations are compared using a Kolmogorov-Smirnov (K-S) statistical test. The K-S statistic and the K-S probability are printed to the 'cluster.log' file. If the K-S probability value is small and the K-S statistic is relatively large, the two distributions differ significantly. A binary heterogeneity will give rise to a significant change in correlation going from one to two state groups, whereas the changes from two to three groups and three to four groups will be less significant. The K-S statistic may be informative, but it does not insure against lowly populated, poor quality state groups showing up. These groups are usually easy to identify and they should be excluded from further analysis.

Program: align

  align is a program for continuous reference-based 3D alignment of individual images, given input reference volume(s). The algorithm is based on advanced differential evolution (DE) for continuous global optimization (Elmlund and Elmlund, 2009). Previously, SIMPLE 1.0 used a Fourier-based interpolation scheme that extracted common lines between reference central sections and the particle section. In this release, the Fourier-based interpolation scheme is reconciled with that used in Frealign. The entire reference section used for matching is now interpolated directly from the 3D Fourier volume, which requires only little extra computation. The method of spectral self-adaptation is applied to estimate a particle dependent low-pass frequency limit.

>>> align mode=<mode nr> fstk=<input Fourier stack(*.fim)> vol1=<refvol_1.spi> [vol2=<refvol_2.spi>...etc.] outfile=<output alignment doc> [msk=<spherical mask radius(in pixels)>] [lp=<low-pass limit(in Å){30}>] [trs=<origin shift(in pixels){3}>] [trsstep=<origin shift stepsize{1}>] [nthr=<nr openMP threads{1}>] [oritab=<input alignment doc>] [fromp=<start ptcl index>] [top=<stop ptcl index>] [nspace=<nr of projectionns in discrete search{500}>] [pgrp=<cn|dn>] [hp=<high-pass limit(in Å)>] [debug=<yes|no>]

Available modes are:

multi-reference alignment with fixed low-pass limit (lp), no input orientations
are required
multi-reference alignment with spectral self-adaptation, input orientations are
for finding filtering threshold or do spectral scoring, input orientations are

The strategy is to begin with mode=20 alignment on the starting volume(s) using a low-resolution low-pass limit (lp typically set to 30 Å). In the first rounds, the origin shift parameters will be far from optimal. The user should make sure that a significant portion of the solutions do not lie on the shift interval borders. If this is the case, the Fourier stack should be shifted (using program shift_fim, described below) and the mode=20 search re-run. This reduces the complexity of the optimization problem and improves the quality of the solutions obtained in later refinement rounds. After completing the shift alignment and the first orientation assignment, volumes are reconstructed with program reconstruct (described below). In subsequent refinement rounds, mode=21-based alignment is combined with reconstruction until quasi-convergence. The shift range should now be limited to around [-3,3]. The mode=21 refinement should be run until convergence, as measured by the Fourier Shell Correlation (FSC) plot calculated between reconstructions from successive rounds.

Program: reconstruct

  reconstruct is a program for reconstructing volumes from a padded SIMPLE 1.0 Fourier transform stack (*.fim) when given input orientations and state assignments (obtained by program align). The algorithm is based on Fourier gridding with a Gaussian window function. This window function reduces the real-space ripple artifacts associated with direct moving window sinc interpolation. The feature sought when implementing this algorithm was to enable quick, reliable reconstruction from aligned individual particle images. A bonus is that the gridding method is very easy to parallelize.

>>> reconstruct fstkpd=fprojs.fim [oritab=algndoc.dat] [msk=<mask radius(in pixels)>] [mw=<molecular weight(in kD){0}>] [frac=<fraction of ptcls to include{0.8}>] [lp=<auto mask low-pass limit(in Å){40}>] [nthr=<nr of openMP threads{1}>] [eo=<yes|no>] [pgrp=<cn|dn>] [debug=<yes|no>] [part=<partition number>] [fromp=<start ptcl>] [top=<stop ptcl>] [state=<state to reconstruct>]

Random orientations are used to generate the reconstruction if no orientations are inputted. The optional parameter eo is used for generating even-odd reconstructions subjected to FSC analysis. The optional parameter mw is used to do solvent flattening of the output volumes to reduce the background noise. If you do not know the molecular weight of your complex or you study materials that are not protein (metallic nanoparticles, for example) then input 0 to indicate that the molecular weight is unknown. The defaiult mw value is set to 0.

Program: automask

  automask is a program for doing solvent flattening of an input spider volume. The algorithm for background removal is based on low-pass filtering and binarization. First, the volume is low-pass filtered to lp. A binary volume is then generated by assigning foreground pixels (=1) based on the volume calculated from the molecular weight, assuming a protein density of 1.43 g/mL. The edge of the resulting binary volume is softened by a real-space low-pass filter before multiplying it with the 'raw' input volume to generate the flattened map.

>>> automask vol1=invol1.spi [vol2=invol2.spi...etc.] box=<box size(in pixels)> smpd=<sampling distance(in Å)> mw=<molecular weight(in kD)> [lp=<low-pass limit(in Å){40}>] [msk=<mask radius(in pixels)>] [edge=<edge size for softening of the molecular envelope(in pixels){3}>] [debug=<yes|no>]

No comments.

Program: cycler

  cycler combines align, reconstruct, and automask into a single program for refinement that is suitable for shared-memory multi-processor architectures, since the parallelization is based on the OpenMP protocol. Not much else to say - the individual components are described above.

>>> cycler fstk=<input Fourier stack(*.fim)> fstkpd=<input padded Fourier stack(*.fim)> vol1=<refvol_1.spi> [vol2=<refvol_2.spi>...etc.] [maxits=<nr of rounds{50}>] [msk=<spherical mask radius(in pixels)>] [mw=<molecular weight(in kD){0}>] [startit=<start iteration nr{1}>] [frac=<fraction of ptcls to include{0.8}>] [lp=<low-pass limit(in Å){30}>] [amsklp=<automask low-pass limit(in Å){40}>] [edge=<edge size for softening of the molecular envelope(in pixels){3}>] [trs=<origin shift(in pixels){3}>] [trsstep=<origin shift stepsize{1}>] [nthr=<nr openMP threads{1}> [oritab=<input alignment doc>] [pgrp=<cn|dn>] [specoff=<yes|no>] [hp=<high-pass limit(in Å)>] [mskfile=<mask.spi>] [debug=<yes|no>]

Cycler can be run on class averages or on smaller data sets. If class averages are used, then turn off the spectral self-adaption (by setting specoff=yes) and set a fixed limit. For refinement of large data sets you need to do distributed execution in a cluster environment.

Program: shift_fim

  shift_fim is a program for shifting a SIMPLE 1.0 Fourier stack according to the origin shifts printed in the alignment document produced by programs cluster, align or cycle. A linear phase shift is applied in Fourier space, so there are no interpolation errors associated with this operation. Happy shifting!

>>> shift_fim fstk=fstack.fim outfstk=shifted_fstack.fim oritab=algndoc.dat [debug=<yes|no>]

No comments.