# Bench philosophy: X-ray data collection in protein crystallography

#### Solving Protein Structures

by Juan David Guzman and Dimitrios Evangelopoulos (ISMB, London), *Labtimes* 02/2011

Diffraction concepts are a headache for novice crystallographers and the underlying mathematics can be sometimes tricky but, after spending a reasonable amount of time studying the phenomena, it suddenly becomes clearer.

In the Bench philosophy article in issue 6 of Lab Times 2010 (page 74), we described the purification and crystallisation techniques used in protein crystallography. Now, we will continue our journey with the downstream steps needed to solve a protein structure for a diffracting protein crystal. But first, let’s refresh on some theoretical aspects of the X-ray diffraction of crystals.

Diffraction is an optical phenomenon, occurring when waves encounter an obstacle of similar length as the incident wavelength. X-ray electromagnetic radiation consist of waves ranging between 0.1 Å to 100 Å wavelength (1 Å = 10-10 m), which are emitted by excited electrons that return to energetically basal atomic states. In order to resolve atomic features, it is necessary to use radiation of the same order of magnitude as the atomic objects. In typical in-house rotating anode X-ray generators the wavelength of the X-ray beam is about 1.54 Å, corresponding to the CuKα transition of copper. In proteins, the length of the covalent chemical bonds between carbon, nitrogen and oxygen varies from 1.24 Å (C=O) to 1.53 Å (C-Cα); these dimensions are very close to the X-ray wavelength and, therefore, enable X-ray radiation to “see” the three-dimensional arrangement of protein atoms.

Experimental hutch of Swiss Light Source (SLS) synchrotron. (Photo: Jörg M. Harms)

In synchrotrons, electromagnetic radiation is released when electrons are bent off a straight path at a speed close to celerity, the speed of light, and this type of X-ray beam has a much higher intensity than conventional lab sources. Synchrotrons allow data collection at almost any wavelength as opposed to the fixed value of a home source and, therefore, much smaller crystals and crystals with very large unit cell dimensions may be used. However, due to the high intensity of synchrotron radiation, the crystals are more prone to radiation damage, even at cryogenic temperatures.

A perfect crystal lattice is an ordered array of unit cells that continuously repeat over three-dimensional space by translation in all directions. The unit cell is the smallest repeated element that generates the crystal, and is defined by three distances (a,b,c) and three angles (α, β, γ). In three dimensions, there are seven lattice systems: cubic, hexagonal, tetragonal, rhombohedral, orthorhombic, monoclinic and triclinic, which are subdivided into 230 space groups. These space groups condense all the possible symmetry operations that the asymmetric unit can adopt for packing the unit cell.

The asymmetric unit is the fundamental unit of crystal construction and it is termed like this because it corresponds to the smallest unit that can be rotated and translated in order to generate the unit cell. The Miller indices are a set of three numbers (*h,k,l*) used to define a family of planes by specifying the slopes of the planes. They are also used to describe the spots that arise due to diffraction from these planes.

Electromagnetic waves are periodic (sinusoidal) functions, consisting of two orthogonal components: electric *E* and magnetic *H*, which are out of phase with one another by 90° or π/2. In X-ray crystallography, only the electrical component interacts with electrons around the atoms producing diffraction phenomena. Waves are characterised by amplitude and wavelength (or period in time units); the wavelength does not change when diffraction occurs. The amplitude is more intense when the waves constructively interfere, that is when summation of waves having a phase difference of exactly 0 or 2nπ occurs. Therefore, the spots in the detector are only observed when constructive interference happens.

This idea of constructive interference is the underlying concept behind the Bragg Equation, which states that a diffraction phenomenon happens only when the distance separating the plane that contains all lattice points is an exact multiple of the wavelength: nλ = 2d*hkl* sinα, where n is an integer, λ is the wavelength, d*hkl* is the distance between the family of planes *hkl* and α is the diffracting angle.

Crystallographers often distinguish real space from reciprocal space. Real space is the three-dimensional space as it is in the crystal, and reciprocal space is related to the space containing the diffraction spots. As the detector is usually flat, it is necessary to rotate the crystal in real space while acquiring the data and this procedure enables recording the three dimensions in reciprocal space. Real and reciprocal space are related to each other by a Fourier transform (FT), which means that you can swap from real space to reciprocal space by applying an FT and vice-versa. In real space, the atoms are positioned in repeating planes having a distance d, which correlates in reciprocal space to a distance 1/d on a plane perpendicular to the plane in real space. In summary, big distances in real space appear as small distances in reciprocal space and the vectors in real space result in perpendicular vectors in reciprocal space.

Once the theoretical bases have been understood, we can proceed with the necessary steps to analyse diffraction images. Basically, the position of the spots in reciprocal space depends on the unit cell and the intensity of the spots depends on the arrangement of the atoms. Therefore, it is possible to obtain the unit cell dimensions and angles (a, b, c,α, β, γ) from the position of the spots. Thereafter, it is necessary to integrate the spots’ intensities from all the images obtained by spinning the crystal and average them in order to get a probable space group. At this point, you must bear in mind that the space group is hypothetical until you get the structure factor amplitudes |F*(hkl)*| for each unique set of *hkl* planes.

The initial calculations in X-ray crystallography are performed in reciprocal space. The electron density at a point x, y, z can be calculated from the Fourier transform over all measured hkl of the structure factors:

ρ(x,y,z) = 1/V Σ*hkl* |F*hkl*| exp[-2πi(*hx*+*ky*+*lz*) + iα(*hkl*)].

The structure factor amplitudes |F*(hkl)*| can be measured from the diffraction pattern; however, the other half of the information, the phase (α) is missing as it cannot be directly determined. No detector can determine the phase and to obtain a value is not trivial at all, and that is why crystallographers call it ‘the phase problem’. The solution to the phase problem can be achieved using experimental or computational approaches, depending on the protein crystal under investigation.

The experimental methods employed to solve the phase problem involve either the addition of heavy atoms into the protein crystals or the utilisation of anomalous scattering atoms in the protein structure (in most cases, heavy atoms). Isomorphous replacement (IR) requires the insertion of heavy atoms in the crystal. The reflections of similar (isomorphous) crystals, with and without the heavy atoms, are compared to obtain the positions of the heavy atoms. From these positions, the initial phases can be calculated. Single or multiple wavelength anomalous dispersion (SAD or MAD) take advantage of the anomalous scattering of the protein atoms, such as selenium inserted through the use of selenomethionine in the growth media. Selenium is the most commonly used element when SAD or MAD methods are used to obtain initial phase information. With SAD only one wavelength is used to collect diffraction data, whereas MAD uses up to three different wavelengths.

In addition to experimental methods used for obtaining initial phase information, there are computational methods such as molecular replacement (MR) and direct *Ab initio* (direct) calculations. The latter require the resolution to be below 1 Å, which has only been achieved for a handful of protein crystals. However, MR takes a solved homologous protein structure, from which the initial phases can be obtained, as similar structures will tend to have related phases as long as they are in the same position in the asymmetric unit. MR tries to align the known structure into the unknown crystal by using rotation and translation functions. First, using the rotation function, the approximate orientation of the two molecules is calculated and then, using the translation function, a superimposition of the two molecules is achieved. If more than one molecule is present in the unit cell, another rotation-translation function is performed until all molecules are placed into the unit cell.

MR delivers initial phases from the known structure generating a first electron density map. Since this map is biased towards the solved structure, a refinement is necessary to get the final model closer to the experimental data than the structure used in MR.

In addition to the previously mentioned bias, several amino acids that were not homologous might be missing from the model or they might be “mutated” to glycine or alanine. Manual building can be used where extra density appears. The addition of big aromatic amino acids such as tryptophan, phenylalanine or tyrosine is a good start as they can act as beacons in the structure. The observation of extra density that fits into the amino acids side chains is a hint that MR has worked properly. Following manual building, several cycles of refinement can be applied and the outcome is a new electron density map that contains more information based on the experimental data, so that more amino acids, loops, confirmations and ligands can be added into the model.

Also, a restrained refinement process can be applied, which uses the geometry (bond lengths and angles) of typical values from the organic chemistry literature. These restraints effectively add more data, overcoming the lack of measured structure factor amplitudes due to poor diffraction of protein crystals. This process of manual building and refinement is repeated until no extra information is added to the model.

The refinement progress is monitored by using Rwork and Rfree values. Rwork is the measure of how well-refined the data is fitted by the model and Rfree is the Rwork for 5% of the data that is omitted from the refinement and detects over-fitting of the model. The equation of Rwork is:

Rwork = Σh |Fo(h)| - |Fc(h)| / Σh |Fo(h)|,

where |Fo(h)| is the observed structure factor amplitudes and |Fc(h)| is the calculated structure factor amplitudes. Initial models often have an Rwork of 40 to 50%, whereas in a refined model Rwork is below 25% in most of the cases (high resolution structures can have an Rwork below 20%). Rfree is typically a few percent higher. Finally, the refined protein model needs to be validated on parameters such as geometry, torsion angles and others for verifying that they are within the acceptable limits.

The geometry and stereochemical properties of the final protein model can be assessed using MolProbity (Davis *et al.*, *Nucleic Acids Res* 35: W375-383) or ProCheck (Laskowski *et al.*, *J Mol Biol* 231(4): 1049-1067) servers. The main-chain torsion angles (φ, ψ) can be evaluated using the Ramachandran plots (Ramachandran *et al.*, *J Mol Biol* 7: 95-99) for verifying allowed dihedral angles.

Following the structure validation, we are now in a position to upload the structure model along with the observed structure factor amplitudes into the protein data bank (PDB), where a final validation is performed by the curators.

Last Changed: 10.11.2012