Adding Value to Data: The Power of Data Transformation for Improved Analytical Outcomes

Post a Comment

Accessing abundant data, including metadata, plays a crucial role in accelerating law enforcement investigations and allowing authorities to quickly adapt to emerging situations and threats. However, the value of the insights gained is dependent on the data being properly cleansed, enriched, fused and analyzed. Applying different transformations through ETL (Extract-Transform-Load), ELT (Extract-Load-Transform), and Federated Search is critical to extracting high-quality analytical insights. But what do the transformations look like?  

For example, July 4, 1976 (07/04/1976) is like any other date and can be correlated, referenced, or aggregated with other events or transactions happening that same day. Readers based in the US might recognize that it represents the 200th anniversary of adopting the US Declaration of Independence. However, on an analytical level there is much more to this date (or any date). Performing basic transformations exposes new dimensions that analytical platforms use to identify specific meanings, patterns, and behaviors. 

Known information about July 4, 1976: 

Dimension  Value 
Year  1976 
Month of Year  July 
Day of Month 
Day of Week  Sunday 
Week of Month 
Week of Year  28 
Day of Year  186 
Quarter-Calendar 
Quarter-Financial 
Season  Summer 
Holiday-US-Federal  Yes 
Leap Year  Yes 

Not only is July 4th a significant US national holiday, but other notable historical events also occurred on this day: West Point was officially opened in 1802, the Philippines gained independence from the US and became a sovereign state in 1946, and the American journalist Geraldo Rivera turned 33 years old (80 today). In other countries, dates and anniversaries are equally important and generally defined in simple look-up tables. 

Furthermore, secondary observations can be derived from the data. Being a US national holiday and a Sunday, we know specific types of business establishments, including banks and financial institutions, will be closed. Therefore, a spike in certain activity (e.g., money transfers) on this day may be cause for further investigation. However, in the Middle East, Sunday is the first workday of the week, so different patterns would be expected. 

Generally, transforms are categorized into three groups depending on their complexity: 

Basic – Simple calculations, substitutions, and formatting (REGEX) 

  • Uppercasing all text, trimming spaces, convert types 
  • Formatting dates into YYYY/MM/DD or extracting day-of-week 
  • Replace characters such as _AVENUE_ with _AVE_ 
  • Calculating ratios between values: col_x / col_y = col_z 
  • Trimming US Zip code+4 to a five alphanumeric value 
  • Rounding lat/longs to five decimal places 

Intermediate – Custom algorithms, heuristics, and reference data  

  • Performing social security or national ID validation checks (age/issuance/ITIN/death) 
  • Simple checksum verification algorithms (Luhn / Verhoeff)  
  • Running phonetic reduction/hashing algorithms (Eudex, Soundex, Ngrams) 
  • Creating secure one-way cipher text hash functions (MD5, SHA-256) 
  • Converting Zip or postal codes to lat/longs or military grid reference system (MGRS) 
  • Calculating the distance between two lat/long points (Haversine Formula) 

Advanced: Often utilize pre-trained machine learning models to define outputs 

  • Creating relationships between people identified in a photograph 
  • Defining the make/model, color, and year of a car from a picture  
  • Identifying weapons, drugs, license plates, and other text/logos in pictures 
  • Flagging stressful situations and content from body-worn camera videos 
  • Parsing and normalizing street addresses into a standardized format 
  • Natural Language Processing (NLP) to identify proper entities and relationships 

When loading data, basic cleanups are often run on all the data to trim leading/trailing spaces, remove double spaces, delete special characters (periods, commas, quotes, dashes, etc.), and convert text to uppercase. These transformations are not mutually exclusive, so one transform’s output can be another’s input, and chaining the transforms into a workflow helps ensure consistency. This example shows the transforms applied to an address: 

Input: 1600 Pennsylvania  Avenue, N.W.   Washington, D.C.,   20500-0004 

First Pass: 1600 PENNSYLVANIA AVENUE NW WASHINGTON DC 205000004 

The process continues with standard replacements to help map values to a common denominator. In the example, the address contains the word “AVENUE,” and the United States Postal Service shows variations to include AV, AVE, AVEN, AVENU, AVN, AVNUE. These derivations should always return a defined value, such as AVE. The same goes for other entries such as STREET, PARKWAY, BOULEVARD, ROUTE, etc. Additionally, the +4 for the US Zip code won’t be needed since it does not provide any additional value for geo-position (in this case) and is trimmed to five characters.  

Output: 1600 PENNSYLVANIA AVE NW WASHINGTON DC 20500 

The more standardized the data values, the more consistent the results. Advanced transformations like LibPostal generate standard address outputs from dirty or erroneous input data. This example is relatively simple, but an address in New York City (20 W 34th St. #123, New York, NY 10001) may cause some parsing inconsistencies. Addresses in other countries like Japan (150-2345 Tokyo-to, Shibuya-ku, Hommachi 2 choume, 4-7, Sunny Mansion 203) might prove too complicated for basic processing scripts not designed for this format. 

The results returned from a geo-position transform don’t change the address but are additive to the overall entity details; new attributes are created as placeholders to put this derived data, as shown below. Often, these transformations are designed (containerized) for a particular purpose and are embeddable into the data fusion or decision intelligence system. 

Lat = 38.89767  

Long = -77.03655 

MGRS = 18SUJ23400740 

Transformations improve data quality, provide additional dimensions to evaluate, and help standardize the content into consistent and reliable values. All these changes are designed to improve the results generated by an analytical or decision intelligence system. However, a critical piece is still required to tie it all together by merging the different entities created. Our upcoming blog in this series will focus on best practices for performing entity resolution. 

Visit NEXYTE.AI to learn more about NEXYTE, the data fusion and machine learning platform revolutionizing decision intelligence. 

The post Adding Value to Data: The Power of Data Transformation for Improved Analytical Outcomes appeared first on Cognyte.

Article Link: Data Transformation & ETL | Cognyte

1 post - 1 participant

Read full topic



Malware Analysis, News and Indicators - Latest topics
Sp123
"The real threat is actually not when the computer begins to think like a human, but when humans begin to think like computers."

Post a Comment