Ask Your Question
1

Trouble transforming US Census API data into Avro

asked 2019-09-03 14:56:21 -0500

avdsa gravatar image

updated 2019-09-10 12:50:58 -0500

Hello,

I have a pipeline that ingests JSON formatted data from the US Census API. Here is the API call I am making for reference: https://api.census.gov/data/2018/pep/population?get=GEONAME,POP,DENSITY&for=place:*&in=state:39&in=county:*&key=4dc4a831c4ab825b649451a82890fc68bc7fe976

I am using an HTML origin configured to ingest JSON array of objects : image description

My goal is to use Schema generator to create an Avro formatted dataset to be ingested into a Kafka Producer as seen below:

image description

I have ingested JSON data before, but the results were returned in a LIST-MAP format, so there was no issue with my pipeline configuration. Now my data is returning in LIST format and I am unsure on what steps are needed in order to proceed.

image description

I am not sure if it is helpful, but some language from the Census website states that their JSON format is non-traditional: The Census uses a nonstandard version of JSON that is streamlined:

  • Data are represented in a two-dimensional array
  • Square brackets [ ] hold arrays
  • Values are separated by a , (comma).

Any help is very much appreciated. Thank you.

09/10/19 UPDATE:

I have finally mapped field values as field names. Here is the code I have so far.

for record in records:
  outlist = {}
  i=0
  for index in record.value:
    index = i
    keys = record.value[index]
    values = record.value[index]
    outrecord = dict({keys:values})
    outlist.update(outrecord)
    i = i + 1
  record.value = outlist
  output.write(record)

I now am working on applying the field names from Record 1 onto the remaining records. I cannot find any documentation related to calling a specific record number out of an array of objects. Any guidance is much appreciated. Thank you.

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted
1

answered 2019-09-03 17:29:02 -0500

metadaddy gravatar image

The problem here is that their JSON representation carries the field names as a list in the first record, with the field values as lists in subsequent records - this is like a mashup of JSON and CSV. The only way I can think of to handle this is to write an evaluator script (Groovy, JavaScript, or Jython) to read in that first record without emitting an output record, then apply the field names to the remaining records.

edit flag offensive delete link more

Comments

Thanks for the direction metadaddy. Would you happen to have any templates related to your solution? I am new to Jython, and spent most of the day attempting to script out a solution, but can't seem to pull the values from the first record and map them as field names on the remaining records.

avdsa gravatar imageavdsa ( 2019-09-04 14:04:41 -0500 )edit

Update from my end, so far I have an almost working Jython Script for this issue. It works fine when I explicitly call what index I am mapping the field names. However, when I try to initiate the process through a nested for loop to go through every index, I keep getting this error...

avdsa gravatar imageavdsa ( 2019-09-09 13:59:26 -0500 )edit

SCRIPTING_04 - Script sent record to error: write(): 1st arg can't be coerced to com.streamsets.pipeline.stage.processor.scripting.ScriptRecord

avdsa gravatar imageavdsa ( 2019-09-09 14:00:16 -0500 )edit

And the script I have is here: for record in records: try: keys = record.value[0] index = 1 for values in record.value: mapped = dict(zip(keys,record.value[index])) index = index + 1 output.write(mapped) except Exception as e: error.write(record, str(e))

avdsa gravatar imageavdsa ( 2019-09-09 14:00:32 -0500 )edit

Please edit your question rather than posting code as comments. It's quite difficult to read.

metadaddy gravatar imagemetadaddy ( 2019-09-09 14:06:55 -0500 )edit
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2019-09-03 14:56:21 -0500

Seen: 143 times

Last updated: Sep 10