U. S. Flight Track Database: Data Summary

Data Qualification, Slicing, and Validation

The data files consist of raw downloaded data that have been quality checked and sorted. The quality checking is essentially the elimination of corrupt, excess, redundant, or inadequate data. Corrupt data takes the form of an occasional isolated en route report that does not follow the correct format. Corrupt data frequency may approach as much one per one hundred thousand. It is detected by automated scanning of the data file and eliminated only after human examination. Some records detected by the scanning procedure are only run-on lines (no line feed character) that can easily be repaired. Excess data are those that fall outside the analysis volume defined by latitude, longitude, and altitude limits. Redundant data are those reports that exactly duplicate others. Excess and redundant data are eliminated by a procedure delineated here. Inadequate data are single en route reports, orphans if you will, that are apparently unrelated to any others. Since you can't make a line with just one point, inadequate data are eliminated.

The qualified data are sliced into pieces by a procedure delineated here, and a series of files are produced with descriptions provided here.

Validation takes place after processed data files have been created and saved. No changes are made to those processed data files on the basis of the validation procedure, delineated here. There is only a monthly validation summary file with a name in the form tmoYYYYMM.val that has the following format. The only way to tell which hours of data are considered valid is to examine the columns identifying valid flight segments in the appropriate tmoYYYYMM.val file. Since estimated data validity is intended to help identify data useful for further analysis, validation results are summarized below.

The following links give a limited statistical summary of the data available on the FTP server.

The hourly summary table has a separate row for each month and a number of columns for various monthly statistics. The third column shows the total hours in the month. The next shows the number of hours in which there are no processed data. The next shows the number of hours for which there are some processed data in the data files. The following columns show the results of the validation procedure. The sixth column shows the number of hours which are considered incomplete (or partial) and therefore not valid as complete or representative of a full hour. The next shows the number of hours which are considered complete (or full) hours considered to be valid or useful for further analysis. The next shows the valid hours as a percentage of the total hours in the month. The last two columns show the percentage of data (non-empty) hours that were identified as invalid (rejected) or valid (remaining).
The segment summary table has a separate row for each month and a number of columns for various monthly statistics. The third column shows the total number of data segments in the month. The next shows the percentage of segments that were rejected because they were in hours considered incomplete (or partial) and therefore not valid as complete or representative of a full hour. The next shows the percentage of segments remaining in hours considered complete (or full) and therefore valid or useful for further analysis. The next shows the number of segments remaining. The next four columns give the same information about the length of all data segments in the month.
The daily summary table has a separate row for each month and a number of columns for various monthly statistics. The third column shows the total days in the month. The next shows the number of days in which there are no valid hourly data. The next shows the number of days for which there are some valid hourly data. The next shows the number of days for which there is a full day of valid hourly data. The next shows the full days of valid data as a percentage of the total hours in the month. The next two columns show the average daily track length and the average daily number of flights. These numbers were only computed for months in which there was at least one of each day of the week (i.e. at least one Monday, at least one Tuesday, . . .) with a full 24 hours of valid data.

NOTE: The is no way to determine if the data set is complete or incomplete from information provided only by the data. Data were identified as 'not valid' or 'valid' by a statistical procedure based on the assumption that weeks of data consisting of sequential collections of hourly data from the entire analysis volume could be collected in stationary ensembles that exhibit reasonable behavior. It might be better to think of the process as identifying data as 'probably invalid' or 'probably not invalid'.

If you have questions about this site, you may send email to Don Garber at donald.p.garber@nasa.gov.

This page was last modified on 29 September 2004.