Select part of a sentence in a spreadsheet cell to define a condition that modifies the text of a tag in an xml file

Benjamin · November 22, 2024, 11:19am

I have a table with over 900 rows like this :

reference	software
ABD0H5D	software1, software2, software3
DPJ6R8G	software2, software3
GHI5P6M	NaN
AGH7U8N	software1, software3

I would like my code to read the ‘software’ column and if ‘software2’ (for example) is present in the cell then the tag text takes the value ‘Yes’ otherwise ‘No’ in my xml file whose title contains the same characters as the ‘reference’ column.

I tried something like :

import pandas as pd
import xml.etree.ElementTree as ET
import os

table1 = pd.read_excel('C:/Users/Documents/software.xlsx', na_values=['NA'], dtype=str)
table2 = table1.replace('\xa0', ' ',regex=True)

for root, dirs, files in os.walk("."):
    for file in files :
        if file[-4:] == '.xml' and file[:3] == 'LLL': 
            filePath2 = os.path.join(root, file)
            xml = ET.parse(filePath2)
            root = xml.getroot()
            nomenc1 = file[4:10]
            nomenc2 = file[4:11] 
            nomenc3= file[4:12] 
            nomenc4 = file[4:14]
            software_tag = ET.SubElement(root, "tagname")
            softw_excel = table2['software'][table2['reference'].isin([nomenc1, nomenc2, nomenc3, nomenc4])] 
            for softw in softw_excel :
                if softw.str.contains('software2') :
                    software_tag.text = 'Yes'
                else:
                    software_tag.text = 'No'
            ET.indent(root_produit)
            xml.write("infodump.xml", encoding='utf-8', xml_declaration=True, method='xml')

But it sends me back :

Traceback (most recent call last):

  Cell In[30], line 23
    if softw.str.contains('software2'):

AttributeError: 'float' object has no attribute 'str'

I also tried this :

syst_excel = table2['software'][table2['reference'].isin([nomenc1, nomenc2, nomenc3, nomenc4])]
S2 = syst_excel.str.contains('software2')
for values in S2 :
    if S2 == True :
        software_tag.text = 'Yes'
    else:
        software_tag.text = 'No'
ET.indent(root) 
xml.write("infodump.xml", encoding='utf-8', xml_declaration=True, method='xml')

I get :

Traceback (most recent call last):

  Cell In[31], line 26
    if S2 == True :

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I know how to match my ‘software’ column with my ‘reference’ column. My question is : How can I retrieve a specific character string from a sentence in a cell in order to either :

use it as a variable in a for, if, else statement, getting rid of the string or boolean format problem ?
if this isn’t possible, create a new column in my table where when the sentence contains my character string in ‘software’, this character string is added to the new column with the NaN cells remaining in NaN in the new column ? For the last one, I tried with pandas.DataFrame.where and pandas.Series.str.contains together but I’m still having problems with the string format. Furthermore, I’m not sure I understand what a ‘Series’ is in relation to a ‘Dataframe’…

Could you help me, please ?

Milan · November 22, 2024, 11:32am

The issue lies in how you’re trying to retrieve the value from the Excel data. The .where() function doesn’t filter the DataFrame in place but rather returns a Series with NaN for non-matching rows. To extract the matching value, you should use a more direct approach like .loc.

Here’s an updated version of your script with the fixes applied:

Updated Script

import pandas as pd
import xml.etree.ElementTree as ET
import os

# Load the Excel file into a DataFrame
table1 = pd.read_excel('C:/Users/Documents/datatypes.xlsx', na_values=['NA'])
table2 = table1.replace('\xa0', ' ', regex=True)

# Walk through the current directory and find XML files
for root, dirs, files in os.walk("."):
    for file in files:
        if file.endswith('.xml'):
            # Parse the XML file
            xml = ET.parse(file)
            root = xml.getroot()

            # Extract the reference number from the file name
            reference_num = file[4:12]  # Extract characters after 'LLL_' (8 characters)

            # Find the row in the DataFrame that matches the reference number
            row = table2.loc[table2['reference'] == reference_num]
            if not row.empty:
                # Extract corresponding dtp1, dtp2, dtp3 values
                dtp1_excel = row['dtp1'].iloc[0]
                dtp2_excel = row['dtp2'].iloc[0]
                dtp3_excel = row['dtp3'].iloc[0]

                # Update the XML tags
                dtp1_xml = root.find("./dtp1")  # Replace 'dtp1' with the actual tag name
                dtp2_xml = root.find("./dtp2")  # Replace 'dtp2' with the actual tag name
                dtp3_xml = root.find("./dtp3")  # Replace 'dtp3' with the actual tag name

                if dtp1_xml is not None:
                    dtp1_xml.text = dtp1_excel
                if dtp2_xml is not None:
                    dtp2_xml.text = dtp2_excel
                if dtp3_xml is not None:
                    dtp3_xml.text = dtp3_excel

                # Save the updated XML file
                ET.indent(root)
                xml.write(file, encoding='utf-8', xml_declaration=True)
            else:
                print(f"No match found for {reference_num} in the Excel file.")

Explanation of Changes:

Extracting Rows Using loc:

The table2.loc[table2['reference'] == reference_num] is used to filter rows directly by the reference column. If a match exists, it will return the row(s); otherwise, it will be empty.

Accessing Values Using .iloc[0]:

Since row is a DataFrame, we extract the first (and only) row with .iloc[0] and access the desired column values.

Finding XML Tags:

Used root.find("./tag_name") to locate XML tags (dtp1, dtp2, dtp3). Replace tag_name with the actual XML tag names.

File Name Handling:

The script dynamically processes all XML files in the directory.

Error Handling:

If a reference is not found, it skips the file and prints a message.

Notes:

Make sure the dtp1, dtp2, and dtp3 tags exist in your XML files and their names match the ones used in root.find("./tag_name").
Adjust the file path for your Excel file and XML files as needed.
Test the script with a few files before running it on all files to ensure it behaves as expected.

Let me know if you need further assistance!