A large Federal Department is responsible for providing identity and eligibility verification. The verification process is complex: each search must scan databases at various agencies. Searches may yield conflicting information, requiring case workers to manually determine. This arduous process of reconciling mismatched records is labor and time-intensive and costly to the American taxpayer.
Reducing the number of cases that go to manual verification was the primary goal, so ICF focused on increasing the number of cases that can be resolved automatically. This required creating “golden records” for every person in the Federal Identity System. Working closely with the client using Agile methodology, ICF developed a solution that uses Apache’s Sqoop to pull data from the disparate sources and then stores it in Amazon Simple Storage Service (Amazon S3). These data are then crunched using algorithms created by our team to remove conflicting information and create the reliable Person Centric System (PCS) that stores the “golden records.” The billions of calculations needed to analyze and organize these data are made possible via Elastic MapReduce (EMR) computing clusters. The final data, exposed as a series of microservices hosted on Amazon EC2 Container Service (ECS) clusters, are then made available to the appropriate front-end systems using Apache Kafka for data requests. All this is hosted on a virtual private cloud environment by Amazon Web Services (AWS), removing the burden of new infrastructure for Federal Identity System. It now serves as the single source of truth for eligibility data across the Department.
To maximize the golden record’s accuracy, ICF cleans and validates all historical and incoming source data, currently over 600 million records and growing. To do this, our data scientists have written and maintained a series of machine learning algorithms in SAS, R, and Python to automatically identify and merge duplicate records and reconcile conflicting information. Previously, this manual process required over 1,000 Oracle stored procedures, driving significant complexity and maintenance costs. Under the new model, ICF’s machine learning algorithms self-optimize over time, requiring only periodic tuning by our analysts.
Early Results and Next Steps:
Looking forward, the team has a pipeline of improvements to provide the client further value in the future. While legacy systems currently require nightly batch updates with the latest information, ICF plans to upgrade this to a Kafka-based stream processing interface. This will not only ensure near real-time updates, but will also allow ICF to push corrections and data reconciliation back to source systems using two-way streaming, promoting interoperability and coordination among agencies.
Evaluating Other Uses:
Would this approach help your clients? Consider the following:
- Do you have to reconcile data within one or more datasets that often produce conflicting results?
- Does your client need to synchronize large volumes of information across various databases?
- Is your client struggling with a growing and increasingly complex set of algorithms or procedures that require deep expertise to maintain?
For more information, contact:
Kevin Wright, Principal
Kyle Tuberson, Vice President