What is tiling array
For the Coif1 wavelet, the low-pass filter coefficients f and high-pass filter coefficients g are [ 13 ]:. The pyramidal structure of the algorithm makes signal decomposition 1 - 2 and signal reconstruction 3 computationally very efficient [ 11 ]. Additional file 1 Figure S1 shows the high-pass and low-pass filter coefficients for the decomposition and reconstruction procedures. These probes are spaced apart every 38 bases on the average, thus creating a base overlap between probes.
No probes were included for interspersed repetitive DNA, thus there are gaps in genome tiling paths on the array. It is interesting to use these datasets to test our algorithm because they have broad peaks. In the current work we propose a new computational approach to analyze ChIP-chip data using wavelet decomposition. We use Coiflets Coif1 as basis functions for the wavelet decomposition [ 11 ], as Coiflets have a nearly symmetrical, peak-like form of the mother wavelet.
This shape resembles the tiling array signal profile at the transcription factor binding sites observed in ChIP-chip experiments see top graph in Figure 2. We chose Coiflets for the wavelet decomposition due to their properties of having the maximum number of zero moments while also having small widths also called support in wavelet literature , ensuring a fast convergence rate [ 11 ].
Input signal and wavelet approximation coefficients of the input signal. Top graph: An example of the part of the input signal for the wavelet decomposition algorithm. Both red and green channels are shown. The signal is assumed to be zero for the missing data gaps in the genomic regions without any probes on the tiling array. We applied a thresholding procedure to the wavelet coefficients at different resolutions in order to delineate the regions of biochemical activity of interest at the same confidence level for all relevant length-scales.
The input signal for the wavelet decomposition is derived as follows. We define the signal as a function of the genomic coordinate to be equal to the measured intensity of this probe for the genomic coordinates inside of the non-overlapping part of the probe, as well as for half of the part which overlaps with the nearest-neighbor probe along the genomic coordinate. Each overlapping region is divided equally between two nearest- probes along the genomic coordinates. The signal is assumed to be zero for the missing data gaps in the genomic regions lacking probes on the tiling array.
An example of the input signal for the wavelet decomposition algorithm is shown in Figure 2 top graph. In this figure, the signal of biochemical activity Pol II binding site in this case contained in the red channel is located between genomic coordinates and and also between and As a result it is possible to log transform them.
We filter out those coefficients while retaining the rest, including those corresponding to peaks. The level of wavelet decomposition to be used is defined by the typical length-scale of the signal variation we wish to analyze. This length-scale the size of the peaks is in the range of base pairs in the ChIP-chip experiments for Pol II data and in the range of base pairs for the histone modification data.
The width of the wavelet at the composition level m is approximately 2 m. The presence of this activity is indicated by the enrichment of the red channel signal relative to the green channel signal. If there is no enrichment of the signal in the red channel relative to the signal in the green channel, we expect the wavelet coefficients for the red and green channels to grow proportionally to each other as a function of the average intensity.
Wavelet coefficients corresponding to the regions of the enrichment of the red signal relative to the green signal will exhibit deviation from this main trend. Broader peaks require higher order wavelet decomposition. Wavelet coefficients A m red vs.
Each point on the graph corresponds to the pair of the wavelet coefficients of the signals of the red and green channels on the tiling array.
As can be seen from the graph, the majority of the points are located inside of the triangle-like area bounded by two lines coming from the origin of the coordinates. Figure 3 subplots C, F, I and L shows the histograms of the distribution function for the log[A m green ]. Many data points inside the red box in Figure 3 subplots B, E, H and K correspond to the peaks inside the contiguously tiled regions whose size is larger than the size of the wavelet used for the signal decomposition.
We can only identify parts of the broad peaks by going back to the original input signal and selecting the regions corresponding to those wavelet coefficients.
A m green. The deviation from the normal distribution is due to the regions of high enrichment attributable to the specific hybridization. For every wavelet coefficient above the threshold we can go back to the original signal and identify the region of the biochemical activity. The size of each region is related to the resolution level of the corresponding wavelet.
At the end, all the detected regions are combined together. If the data are normal the plot will be linear. The log-normal distribution is the characteristic feature of multiplicative random processes [ 16 ]. One explanation for the appearance of the log-normal distribution in the data is that the measurement process of the fluorescent signal of the tiling array involves multiplicative random factors. These factors can include the collection efficiency of the light during array scanning and the variation of the quantum efficiency of the pixels in the CCD camera.
The log-normal distribution was previously observed in the fluorescence microscopy signal [ 17 ]. Furthermore, the log-normal distribution of the data could be attributed to the kinetics of the hybridization process on the array. A very interesting feature can be observed from Figure 5 : Approximation coefficients for the red channel are consistently above the approximation coefficients for the green channel over the region of the biochemical activity across several wavelet decomposition levels.
We use this characteristic to decrease the number of false-positive calls. We describe the numerical procedure ensuring the consistency property of the wavelet coefficients below.
Illustration of the consistency property for the wavelet coefficients corresponding to a region of biochemical activity. We used the same genomic region as in Figure 2. The red box indicates the hit regions. According to 2.
Repeating the same argument for the decomposition level m we find nine approximation coefficients at the resolution level m-1 contributing mostly to the numerical values of three approximation coefficients at the resolution level m. Our algorithm performs wavelet decomposition of the signals in red and green channels, computes the standard deviations of the distribution functions of the log-ratios of of the wavelet coefficients, thresholds the log-ratios, checks the thresholded wavelet coefficients for consistency, generates hit regions from the wavelet coefficients selected by the algorithm and estimates the FPR false positive rate for the chosen threshold.
The width of the wavelet at the decomposition level m is approximately 2 m. We call a region corresponding to the approximation coefficient A m n a hit only if:. The thresholding allows us to select peaks of different sizes with the same confidence level. The same as in 1 is true for the log ratios of at least three approximation coefficients at the resolution levels m-1 and m-2 contributing greatly to A m n.
Requirements a and b impose consistency constraints on the wavelet coefficients across three resolution levels which help to reduce the number of false positives. Each time we find a hit we go from A m n , to A m -1 2 n -1 , A m -1 2 n -2 , and A m -1 2 n -3 ; until we reach the original input signal to identify the region of biochemical activity.
We combine together each overlapping group of hits into one big hit region. We call N the total number of final hit regions. Signal of the red channel is randomly shuffled between the probes. Figure 5 shows Pol II binding sites located between genomic coordinates and and also between and The wavelet coefficient which satisfy conditions 3 and the regions of the signal corresponding to those wavelet coefficients are indicated by the red boxes.
Raw signals for the green and red channels are shown as blue and pink tracks. Hit regions obtained by combining the information from these resolution levels combining together overlapping yellow bars are shown as red bars step 4 of our algorithm.
IGB snapshot of the signal on tiling array and hit regions. Hit regions obtained by combining the information from these resolution levels are shown as red bars. In order to test the performance of our method with the consistency constraint previously described, we applied our algorithm to the Spike-in data from the ENCODE Nimblegen tiling array.
Mixtures of human genomic DNA and human sequences at various concentrations were hybridized on the array [ 5 ].
Spike-in data was obtained from human sequences of approximately the same size, which were generated in the laboratory. Spike-in data allows for an objective estimation of the performance of our method and a comparison with other methods. We choice of the model parameters was based on our observation that the size of the wavelets should be comparable to the size of the peaks to identify. ROC-type curves were generated by plotting the Sensitivity i. The optimal ROC-type curve is the one closest to the left upper corner.
A sliding window consisting of bp was used, and windows with high median values were identified as hits. The Splitter algorithm [ 7 ] incrementally changes the cutoff value of the signal and compares the total number of hits before and after the change. If the ratio of the number of hits before and after the cutoff change is smaller than a pre-defined value, the algorithm stops and hits before the last cutoff change are reported as final hits.
Clusters of probes located closer than a "maxgap" parameter were merged together. Clusters of probes containing the number of probes smaller than a "minrun" parameter were discarded. Permu [ 6 ] identifies the peaks within the sliding window based on iterative thresholding procedure. FPR false positive rate is assigned to each peak using the randomized data. Our method demonstrated excellent performance compared to other methods as can be seen from the ROC-type curve in Figure 7.
The intuitive reason behind of such a good performance of our method is that the shape of the wavelets we use is very similar to the shape of peaks of the signal. Another reason of a good performance is the use of consistency constraint which reduces the number of false positives. Roc-like curves generated from the Spike-in data experiments data. Our method demonstrates excellent performance compared to other methods, as shown by the ROC-type curves.
We analyzed tiling array data using wavelet transformations, and from the resulting wavelet coefficients we obtained clear intensity and length-scale separation between the background signal and the signal coming from the regions of biochemical activity.
A thresholding procedure was applied to the wavelet coefficients at different resolution levels with the consistency constraint in order to delineate the regions of biochemical activity of interest at the same confidence level for all the relevant length-scales. This method was successfully applied to several ChIP-chip data sets, including Pol II and histone modification experiments.
Our method demonstrated excellent performance using Spike-in data from the Nimblegen tiling array. Curr Opin Plant Biol , 10 5 — Chromosome Res , 13 3 — Genomics , 83 3 — Quackenbush J: Microarray data normalization and transformation.
Nat Genet , — Genome Res , 18 3 — Embo Rep , 8 8 — Genome Biol Biometrics , 63 3 — Nature Biotechnology , 25 2 — Book Google Scholar. Ieee T Pattern Anal , 11 7 — Article Google Scholar. Google Scholar. Nature , — Genome Res , 18 8 — These regions are masked using the "Repeat Masker". All the three output options X, N and lower case are recognized by Array Designer.
Array Designer designs evenly tiled probes across the genome which facilitate genome wide analysis of many important biological functions including sites of chromatin modification and sites of DNA methylation. Genome-wide tiling arrays can overcome many of the limitations of the previous approaches by comprehensively probing transcription in all regions of the genome.
Whole genome high density tiling arrays provide a universal platform for genomic analysis through whole genome amplification. Array Designer designs thousands of primers and probes for oligo and cDNA microarrays in seconds. It designs probes for SNP detection , microarray gene expression and gene expression profiling.
In addition, comprehensive support for tiling arrays and resequencing arrays is available. Any significant homologies identified are automatically avoided while designing oligos.
Repeat regions are identified and automatically avoided.
0コメント